Problem with addprocs on Amazon EC2 #14768
Closed
lruthotto opened this Issue on Jan 22 · 4 comments
Projects
None yet
Labels
parallel
Milestone
No milestone
Assignees
No one assigned
5 participants
@lruthotto
@ViralBShah
@tanmaykm
@amitmurthy
@kshyatt
Notifications

You’re not receiving notifications from this thread.
@lruthotto
lruthotto commented on Jan 22

I have problems connecting establishing a connection between instances of the same AMI on the Amazon EC2.

I roughly followed @mlubin's tutorial here: http://julialang.org/blog/2013/04/distributed-numerical-optimization/ to setup my workers and change the file /etc/ssh/ssh_config to turn off StrictHostKeyChecking and also providing the ssh key.

After this I can ssh between the workers with no problem (outside julia) by typing

ssh xxx.xx.xx.xx6

(no password prompt or key file). Starting on worker xxx.xx.xx.xx7 julia I see a weird behavior:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.4-pre+2 (2016-01-18 02:17 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit a85c3a0* (4 days old release-0.4)
|__/                   |  x86_64-linux-gnu

julia> addprocs(["ubuntu@xxx.xx.xx.xx6"],tunnel=true)
1-element Array{Int64,1}:
 2
julia> addprocs(["ubuntu@xxx.xx.xx.xx6"],tunnel=true)
1-element Array{Int64,1}:
 3

julia> addprocs(["ubuntu@xxx.xx.xx.xx7"],tunnel=true)

The first two commands work and two remote workers are added. The last command stalls forever and eventually times out. I tried this in many different variants. For example, reversing the order gives the same problem (first adding a worker on xxx.xx.xx.xx7 is okay, but then adding one on xxx.xx.xx.xx6 fails).

Can anybody please explain me whats going on?
@kshyatt kshyatt added the parallel label on Jan 22
@ViralBShah
The Julia Language member
ViralBShah commented on Jan 22

Cc @amitmurthy @tanmaykm
@tanmaykm
The Julia Language member
tanmaykm commented on Jan 22

It could possibly be due to connections timing out. Could you try setting JULIA_WORKER_TIMEOUT to a higher value? See: http://docs.julialang.org/en/latest/stdlib/parallel/#Base.addprocs

If that's the case, putting all ec2 instances in the same placement group may help.
@amitmurthy
The Julia Language member
amitmurthy commented on Jan 23

It is assumed that all the workers are on a directly connected open network, i.e. they need to be able to connect with each other on all ports. The tunnel option is only between master and the workers.

A couple of options:

1) Put all the worker nodes in a security group that allows all traffic between nodes of the same security group.

2) In case the computation does not need worker-worker messaging, using topology=:master_slave will allow the cluster to come up. See http://docs.julialang.org/en/latest/manual/parallel-computing/#specifying-network-topology-experimental
@lruthotto
lruthotto commented on Jan 23

Thanks a lot for your help.

@tanmaykm: Changing the timeout did not help me. I think also my instances where pretty close to each other. Using ssh from the terminal at least was possible between all of them and it did not take much time to connect.

What really helped, was allowing all traffic between the nodes, when launching all the instances. Before, I allowed SSH traffic, assuming this is what julia needs. I was not sure if allowing All Traffic is in general a good idea from a security point of view, but I found that you can set IP ranges.

@amitmurthy: The :master_slave topology sounds interesting. In fact that's exactly what I need for my problem. I will test that later and see if it improves the scalability of my code.

Thanks again!
@lruthotto lruthotto closed this on Jan 23