Fatal error when using addproc and rmproc 7646 Closed colintbowers opened this Issue on Jul 17, 2014 ยท 5 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 4 participants colintbowers ViralBShah amitmurthy jiahao Notifications colintbowers colintbowers commented on Jul 17, 2014 Julia throws a fatal error when I add and remove processes addproc and rmproc , but only if I don't do any parallel processing in between. I originally posted this in StackOverflow, thinking that I was doing something wrong, but a user suggested the problem was a bug and I should file it here. For full detail, see my StackOverflow question at: http: stackoverflow.com questions 24774706 julia-doesnt-like-it-when-i-add-and-remove-processes-without-doing-any-parallel In short, if I run the following code: Set parameters numCore 4; Add workers print Adding workers... ; addprocs numCore - 1 ; println string string numCore-1 , workers added. ; Detect number of cores println string Number of processes detected , string nprocs ; Remove the additional workers print Removing workers... ; rmprocs workers ; println Done. ; println Subroutine complete. ; I get the following output: Adding workers... 3 workers added. Number of processes detected 4 Removing workers... Done. Subroutine complete. fatal error on In 86 : fatal error on 88: ERROR: 87: ERROR: connect: connection refused ECONNREFUSED in yield at multi-jl:1540 connect: connection refused ECONNREFUSED in wait at task-jl:117 in wait_connected at stream-jl:263 in connect at stream-jl:878 in Worker at multi-jl:108 in anonymous at task-jl:876 in yield at multi-jl:1540 in wait at task-jl:117 in wait_connected at stream-jl:263 in connect at stream-jl:878 in Worker at multi-jl:108 in anonymous at task-jl:876 which suggests my code ran to completion, then threw a fatal error. However, I can get rid of the error by actually doing some parallel processing in between. That is, if I add the lines: Do some stuff XLst rand 10, 1 for i in 1:8 ; XMean pmap mean, XLst ; to my above routine immediately after detecting the presence of 4 cores, then I don't get an error. Thanks. jiahao jiahao added the parallel label on Jul 18, 2014 ViralBShah ViralBShah modified the milestone: 0.3 on Jul 18, 2014 ViralBShah The Julia Language member ViralBShah commented on Jul 18, 2014 Cc: amitmurthy tanmaykm ViralBShah The Julia Language member ViralBShah commented on Jul 18, 2014 I see it if I run it all in one shot, but if I just have some delay between the adding phase and removal phase above, it doesn't crash. Seems like some kind of a race somewhere while freeing resources. amitmurthy The Julia Language member amitmurthy commented on Jul 21, 2014 Explanation: pid 1 does an addprocs 3 addprocs returns after it has established connections with all 3 new workers. However, at this time the the connections between workers may not have been setup, i.e. from pids 3 - 2, 4 - 2 and 4 - 3. Now pid 1 calls rmprocs workers , i.e., pids 2, 3 and 4. As pid 2 exits, the connection attempt in 4 to 2, results in an error. Since we have redirected the output of pid 4, to the stdout of pid 1, we see the same error printed. The system is still in a consistent state, though the printing of said error messages may suggest something amiss. We could add a flag in the worker structure and stop printing the captured stdout when rmprocs has been called. But I am afraid, it may end up hiding any other real network issues that can crop up and may make debugging a little more involved. Since an use case of addprocs immediately followed by an rmprocs is rare, I would suggest not fixing this for now. ViralBShah The Julia Language member ViralBShah commented on Jul 21, 2014 Can we add a barrier so that the addprocs only returns when all the connections are set up? That would seem like a reasonable thing to do. amitmurthy The Julia Language member amitmurthy commented on Jul 21, 2014 Yeah, that is probably the right thing to do. Have addprocs wait for a initialization complete message from the newly added workers before returning. amitmurthy amitmurthy referenced this issue on Jul 24, 2014 Merged addprocs: wait till all workers are connected to each other. 7713 JeffBezanson JeffBezanson closed this in 7713 on Jul 25, 2014