Problem with addprocs on Amazon EC2 #14768 Closed lruthotto opened this Issue on Jan 22 · 4 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 5 participants @lruthotto @ViralBShah @tanmaykm @amitmurthy @kshyatt Notifications You’re not receiving notifications from this thread. @lruthotto lruthotto commented on Jan 22 I have problems connecting establishing a connection between instances of the same AMI on the Amazon EC2. I roughly followed @mlubin's tutorial here: http://julialang.org/blog/2013/04/distributed-numerical-optimization/ to setup my workers and change the file /etc/ssh/ssh_config to turn off StrictHostKeyChecking and also providing the ssh key. After this I can ssh between the workers with no problem (outside julia) by typing ssh xxx.xx.xx.xx6 (no password prompt or key file). Starting on worker xxx.xx.xx.xx7 julia I see a weird behavior: _ _ _(_)_ | A fresh approach to technical computing (_) | (_) (_) | Documentation: http://docs.julialang.org _ _ _| |_ __ _ | Type "?help" for help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 0.4.4-pre+2 (2016-01-18 02:17 UTC) _/ |\__'_|_|_|\__'_| | Commit a85c3a0* (4 days old release-0.4) |__/ | x86_64-linux-gnu julia> addprocs(["ubuntu@xxx.xx.xx.xx6"],tunnel=true) 1-element Array{Int64,1}: 2 julia> addprocs(["ubuntu@xxx.xx.xx.xx6"],tunnel=true) 1-element Array{Int64,1}: 3 julia> addprocs(["ubuntu@xxx.xx.xx.xx7"],tunnel=true) The first two commands work and two remote workers are added. The last command stalls forever and eventually times out. I tried this in many different variants. For example, reversing the order gives the same problem (first adding a worker on xxx.xx.xx.xx7 is okay, but then adding one on xxx.xx.xx.xx6 fails). Can anybody please explain me whats going on? @kshyatt kshyatt added the parallel label on Jan 22 @ViralBShah The Julia Language member ViralBShah commented on Jan 22 Cc @amitmurthy @tanmaykm @tanmaykm The Julia Language member tanmaykm commented on Jan 22 It could possibly be due to connections timing out. Could you try setting JULIA_WORKER_TIMEOUT to a higher value? See: http://docs.julialang.org/en/latest/stdlib/parallel/#Base.addprocs If that's the case, putting all ec2 instances in the same placement group may help. @amitmurthy The Julia Language member amitmurthy commented on Jan 23 It is assumed that all the workers are on a directly connected open network, i.e. they need to be able to connect with each other on all ports. The tunnel option is only between master and the workers. A couple of options: 1) Put all the worker nodes in a security group that allows all traffic between nodes of the same security group. 2) In case the computation does not need worker-worker messaging, using topology=:master_slave will allow the cluster to come up. See http://docs.julialang.org/en/latest/manual/parallel-computing/#specifying-network-topology-experimental @lruthotto lruthotto commented on Jan 23 Thanks a lot for your help. @tanmaykm: Changing the timeout did not help me. I think also my instances where pretty close to each other. Using ssh from the terminal at least was possible between all of them and it did not take much time to connect. What really helped, was allowing all traffic between the nodes, when launching all the instances. Before, I allowed SSH traffic, assuming this is what julia needs. I was not sure if allowing All Traffic is in general a good idea from a security point of view, but I found that you can set IP ranges. @amitmurthy: The :master_slave topology sounds interesting. In fact that's exactly what I need for my problem. I will test that later and see if it improves the scalability of my code. Thanks again! @lruthotto lruthotto closed this on Jan 23