Problem with addprocs on Amazon EC2 14768 Closed lruthotto opened this Issue on Jan 22 · 4 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 5 participants lruthotto ViralBShah tanmaykm amitmurthy kshyatt Notifications lruthotto lruthotto commented on Jan 22 I have problems connecting establishing a connection between instances of the same AMI on the Amazon EC2. I roughly followed mlubin's tutorial here: http: julialang.org blog 2013 04 distributed-numerical-optimization to setup my workers and change the file etc ssh ssh_config to turn off StrictHostKeyChecking and also providing the ssh key. After this I can ssh between the workers with no problem outside julia by typing ssh xxx.xx.xx.xx6 no password prompt or key file . Starting on worker xxx.xx.xx.xx7 julia I see a weird behavior: _ _ _ _ _ | A fresh approach to technical computing _ | _ _ | Documentation: http: docs.julialang.org _ _ _| |_ __ _ | Type ?help for help. | | | | | | | _` | | | | |_| | | | _| | | Version 0.4.4-pre+2 2016-01-18 02:17 UTC _ |\__'_|_|_|\__'_| | Commit a85c3a0 4 days old release-0.4 |__ | x86_64-linux-gnu julia addprocs ubuntu xxx.xx.xx.xx6 ,tunnel true 1-element Array Int64,1 : 2 julia addprocs ubuntu xxx.xx.xx.xx6 ,tunnel true 1-element Array Int64,1 : 3 julia addprocs ubuntu xxx.xx.xx.xx7 ,tunnel true The first two commands work and two remote workers are added. The last command stalls forever and eventually times out. I tried this in many different variants. For example, reversing the order gives the same problem first adding a worker on xxx.xx.xx.xx7 is okay, but then adding one on xxx.xx.xx.xx6 fails . Can anybody please explain me whats going on? kshyatt kshyatt added the parallel label on Jan 22 ViralBShah The Julia Language member ViralBShah commented on Jan 22 Cc amitmurthy tanmaykm tanmaykm The Julia Language member tanmaykm commented on Jan 22 It could possibly be due to connections timing out. Could you try setting JULIA_WORKER_TIMEOUT to a higher value? See: http: docs.julialang.org en latest stdlib parallel Base.addprocs If that's the case, putting all ec2 instances in the same placement group may help. amitmurthy The Julia Language member amitmurthy commented on Jan 23 It is assumed that all the workers are on a directly connected open network, i.e. they need to be able to connect with each other on all ports. The tunnel option is only between master and the workers. A couple of options: 1 Put all the worker nodes in a security group that allows all traffic between nodes of the same security group. 2 In case the computation does not need worker-worker messaging, using topology :master_slave will allow the cluster to come up. See http: docs.julialang.org en latest manual parallel-computing specifying-network-topology-experimental lruthotto lruthotto commented on Jan 23 Thanks a lot for your help. tanmaykm: Changing the timeout did not help me. I think also my instances where pretty close to each other. Using ssh from the terminal at least was possible between all of them and it did not take much time to connect. What really helped, was allowing all traffic between the nodes, when launching all the instances. Before, I allowed SSH traffic, assuming this is what julia needs. I was not sure if allowing All Traffic is in general a good idea from a security point of view, but I found that you can set IP ranges. amitmurthy: The :master_slave topology sounds interesting. In fact that's exactly what I need for my problem. I will test that later and see if it improves the scalability of my code. Thanks again! lruthotto lruthotto closed this on Jan 23