Fatal error when using addproc and rmproc #7646 Closed colintbowers opened this Issue on Jul 17, 2014 · 5 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 4 participants @colintbowers @ViralBShah @amitmurthy @jiahao Notifications You’re not receiving notifications from this thread. @colintbowers colintbowers commented on Jul 17, 2014 Julia throws a fatal error when I add and remove processes (addproc and rmproc), but only if I don't do any parallel processing in between. I originally posted this in StackOverflow, thinking that I was doing something wrong, but a user suggested the problem was a bug and I should file it here. For full detail, see my StackOverflow question at: http://stackoverflow.com/questions/24774706/julia-doesnt-like-it-when-i-add-and-remove-processes-without-doing-any-parallel In short, if I run the following code: #Set parameters numCore = 4; #Add workers print("Adding workers... "); addprocs(numCore - 1); println(string(string(numCore-1), " workers added.")); #Detect number of cores println(string("Number of processes detected = ", string(nprocs()))); #Remove the additional workers print("Removing workers... "); rmprocs(workers()); println("Done."); println("Subroutine complete."); I get the following output: Adding workers... 3 workers added. Number of processes detected = 4 Removing workers... Done. Subroutine complete. fatal error on In [86]: fatal error on 88: ERROR: 87: ERROR: connect: connection refused (ECONNREFUSED) in yield at multi.jl:1540 connect: connection refused (ECONNREFUSED) in wait at task.jl:117 in wait_connected at stream.jl:263 in connect at stream.jl:878 in Worker at multi.jl:108 in anonymous at task.jl:876 in yield at multi.jl:1540 in wait at task.jl:117 in wait_connected at stream.jl:263 in connect at stream.jl:878 in Worker at multi.jl:108 in anonymous at task.jl:876 which suggests my code ran to completion, then threw a fatal error. However, I can get rid of the error by actually doing some parallel processing in between. That is, if I add the lines: # Do some stuff XLst = {rand(10, 1) for i in 1:8}; XMean = pmap(mean, XLst); to my above routine immediately after detecting the presence of 4 cores, then I don't get an error. Thanks. @jiahao jiahao added the parallel label on Jul 18, 2014 @ViralBShah ViralBShah modified the milestone: 0.3 on Jul 18, 2014 @ViralBShah The Julia Language member ViralBShah commented on Jul 18, 2014 Cc: @amitmurthy @tanmaykm @ViralBShah The Julia Language member ViralBShah commented on Jul 18, 2014 I see it if I run it all in one shot, but if I just have some delay between the adding phase and removal phase above, it doesn't crash. Seems like some kind of a race somewhere while freeing resources. @amitmurthy The Julia Language member amitmurthy commented on Jul 21, 2014 Explanation: pid 1 does an addprocs(3) addprocs returns after it has established connections with all 3 new workers. However, at this time the the connections between workers may not have been setup, i.e. from pids 3 -> 2, 4 -> 2 and 4 -> 3. Now pid 1 calls rmprocs(workers()) , i.e., pids 2, 3 and 4. As pid 2 exits, the connection attempt in 4 to 2, results in an error. Since we have redirected the output of pid 4, to the stdout of pid 1, we see the same error printed. The system is still in a consistent state, though the printing of said error messages may suggest something amiss. We could add a flag in the worker structure and stop printing the captured stdout when rmprocs has been called. But I am afraid, it may end up hiding any other real network issues that can crop up and may make debugging a little more involved. Since an use case of addprocs immediately followed by an rmprocs is rare, I would suggest not fixing this for now. @ViralBShah The Julia Language member ViralBShah commented on Jul 21, 2014 Can we add a barrier so that the addprocs only returns when all the connections are set up? That would seem like a reasonable thing to do. @amitmurthy The Julia Language member amitmurthy commented on Jul 21, 2014 Yeah, that is probably the right thing to do. Have addprocs wait for a "initialization complete" message from the newly added workers before returning. @amitmurthy amitmurthy referenced this issue on Jul 24, 2014 Merged addprocs: wait till all workers are connected to each other. #7713 @JeffBezanson JeffBezanson closed this in #7713 on Jul 25, 2014