Issues starting many workers 9591 Closed samuela opened this Issue on Jan 3, 2015 · 20 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 7 participants samuela amitmurthy ViralBShah tkelman Keno hfty ihnorton Notifications samuela samuela commented on Jan 3, 2015 I have trouble starting more than 32 workers: skainswo smp014 julia $ julia . julia -p 64 Master process id 1 could not connect within 60.0 seconds. exiting. Master process id 1 could not connect within 60.0 seconds. exiting. ERROR: stream is closed or unusable in check_open at stream-jl:294 in write at stream-jl:730 in send_msg_ at multi-jl:178 in send_msg_now at multi-jl:137 in add_workers at multi-jl:255 in addprocs at multi-jl:1237 in process_options at . client-jl:236 in _start at . client-jl:354 Worker 2 terminated.Worker 3 terminated. WARNING: Forcibly interrupting busy workers Worker -1 terminated.Worker -1 terminated. ... Master process id 1 could not connect within 60.0 seconds. exiting. Master process id 1 could not connect within 60.0 seconds. exiting. ... Master process id 1 could not connect within 60.0 seconds. exiting. Sometimes it works, most of the time it doesn't. I have the same issue with addprocs. amitmurthy The Julia Language member amitmurthy commented on Jan 3, 2015 What is your machine configuration? A newly launched worker exits if the the master process does not connect within 60 seconds - this is done to ensure that falied launches do not result in workers hanging around. samuela samuela commented on Jan 4, 2015 Well, it depends. I'm trying to run on a cluster which gives me different resources on each run, possibly across different machines. My guess is that cross-machine communication is slow for some reason. Is there a way to adjust the timeout? amitmurthy The Julia Language member amitmurthy commented on Jan 4, 2015 Not currently, but since this issue crops up frequently, it is a good idea to do so. Also, I think it is a good idea to have -p without a number argument default to launching as many workers as CPU_CORES, which is what you probably want. amitmurthy amitmurthy added the parallel label on Jan 4, 2015 amitmurthy The Julia Language member amitmurthy commented on Jan 4, 2015 On my machine I see each additional Julia worker require around 80MB without any packages or code being loaded. I suspect on lower spec'ed machines, the workers are being swapped out and hence timing out in your case. What is the type of cluster you are running on? Based on your example, I think you are launching independent local julia clusters on each machine. As a workaround you could add a addprocs , without any arguments at the beginning of your script - this will just add as many workers as cores on that node, i.e, julia myscript-jl with the first line in myscript-jl being addprocs . samuela samuela commented on Jan 4, 2015 I believe addprocs wasn't introduced until after 0.3 so I tried adding addprocs Sys.CPU_CORES to the beginning of my script. No luck unfortunately. Sys.CPU_CORES is 64 for me, so I think it's doing essentially the same thing. samuela samuela commented on Jan 4, 2015 Also, I'm running with 64g so I'm not sure why they'd be swapped out so quickly. amitmurthy The Julia Language member amitmurthy commented on Jan 4, 2015 This has been optimized in 0.4 where connections to the workers are setup in parallel. But a means of overriding default timeout values, through environment variables or other means should be provided. samuela samuela commented on Jan 4, 2015 Ah, ok great! I'll try running with a more recent version then. ViralBShah The Julia Language member ViralBShah commented on Jan 5, 2015 Is this something we can backport parts of? amitmurthy The Julia Language member amitmurthy commented on Jan 5, 2015 I would prefer not to since the ClusterManager interface has also changed. But the ability to override default timeout values should be backportable. ihnorton ihnorton added the backport pending label on Jan 8, 2015 tkelman The Julia Language member tkelman commented on Mar 20, 2015 But the ability to override default timeout values should be backportable. Bump. Do we still want to do this? Keno The Julia Language member Keno commented on Mar 20, 2015 I say leave it, as a workaround you can probably use addprocs in a loop, so it's not worth the risk of backporting. tkelman tkelman removed the backport pending label on Mar 20, 2015 hfty hfty commented on May 29, 2015 I'm running into the same issue on the latest 0.4.0-dev+5035 between two Amazon EC2 instances. Trying to launch 36 workers at once times out. The workaround of using a for loop to launch 12 at once works, however. amitmurthy The Julia Language member amitmurthy commented on May 29, 2015 Can you provide some more details? I saw some issues while launching similar machine configurations with the head node master , being my laptop which was on wireless internet, and using ssh tunneling. However there was no issue when all processes, including master were within the AWS network. Are you using the tunnel option? I should have a fix to optimize the tunnel option soon. hfty hfty commented on Jun 2, 2015 After some messing around, I believe that it's probably due to slower network connectivity between certain instances. In particular, my master was an m3.large instance, which has a Moderate network performance. With a master with a faster connectivity, I don't have that issue anymore. I am using the tunnel option, something like this: for i 1:3 print addprocs addr, 12 ; tunnel true,sshflags `-i home centos .ssh XXX.pem` , \n end I didn't manage to set up connections without the tunneling option. Should that be faster? amitmurthy The Julia Language member amitmurthy commented on Jun 2, 2015 Yes, tunnel setup is currently a bit slow. I am working on a patch for this here - 11520 amitmurthy The Julia Language member amitmurthy commented on Jun 2, 2015 Without tunneling, you will need to ensure that your security groups allows inbound connections to ports 9009 and greater from the machine running the master. A reasonable config would be to allow inbound 9009 to 9100 from the machine doing an addprocs. hfty hfty commented on Jun 3, 2015 Thanks, I'm going to try that. amitmurthy The Julia Language member amitmurthy commented on Jun 10, 2015 Can you test now that 11520 has been merged. It should be fast enough even with tunnel true hfty hfty commented on Jun 14, 2015 Finally got a chance to test. It's much, much faster indeed with the tunnel and without, even on instances with slower network connectivity. Thank you! amitmurthy amitmurthy closed this on Jun 14, 2015