Groups 54 of 99+ julia-users › Is there a tutorial on how to set up my own Julia cluster? 11 posts by 6 authors Ismael Venegas Castelló 9/25/15 Hello everyone! I am trying to set up a Julia cluster with 20 nodes, this is the very first time I've tried something like this. I have looked around for examples, but documentation is not very helpful for me: Julia can be started in parallel mode with either the -p or the --machinefile options. -p n will launch an additional n worker processes, while --machinefile file will launch a worker for each line in file file. The machines defined in file must be accessible via a passwordless ssh login, with Julia installed at the same location as the current host. Each machine definition takes the form [count*][user@]host[:port] [bind_addr[:port]] . user defaults to current user, port to the standard ssh port. count is the number of workers to spawn on the node, and defaults to 1. The optional bind-to bind_addr[:port] specifies the ip-address and port that other workers should use to connect to this worker. This is what I think I have understood so far: Ok I list the machines on a machine file, that's easy, I have a file like this: n user@555.555.555.555 n user@555.555.555.556 n user@555.555.555.555 The machines defined in file must be accessible via a passwordless ssh login, This is the part that is difficult for me the most, it says that machines must be accesible via paswordless ssh with Julia installed at the same location as the current host. I understand this as I need to install Julia en every node in the same location, so I have 20 nodes, same software and hardware stacks. Does this means that the nodes must be of the same operating system? the same bits (32/64) only? Right now I have 20 CentOS 6.7 (64 bits) nodes with julia-0.3.11 installed from the generic linux binaries (64bits), all of them installed at /opt/julia-0.3.11/bin (added to the PATH and already exported in /etc/profile) Now the plan in my mind is to use my laptop (windows 7 64 bits, julia-0.3.11 64 bits) as master node and control the cluster with that, so according to what I understand, I'll need to do (leaving password blank): ssh-keygen -t rsa From my Windows laptop (I plan to install Arch Linux soon), in order to create my ssh key and then: cat ~/.ssh/id_rsa.pub | ssh user@hostname 'cat >> .ssh/authorized_keys' To every node? So I have to be running the ssh server at every one of them? (I understand I'll need it at the master node) This is where I simply don't understand anymore, I haven't seen any tutorial, or article, or something like that, just that paragraph in the manual, I know there is ClusterManagers.jl but that sounds even more complicated for me right now. I also want to help David Sanders to set up another cluster (once I got this figured out) in his lab at Science Faculty, UNAM. I promise to enhance the documentation around this topic once I understand this. What do you guys think, do I have it all wrong? If anyone can help me, I'll be very grateful, thank's in advance! Spencer Russell 9/25/15 Hi Ismael, So I don’t actually know anything about setting up a Julia cluster specifically, but it sounds like you do indeed need to have an SSH server set up on each machine. That’s actually not very uncommon on linux boxes and it’s very possible there’s already one running by default. One useful utility is `ssh-copy-id user@hostname` which will add your default public key ($HOME/.ssh/id_rsa.pub) to the authorized_keys list on the remote machine. Make sure to use the same remote machine user that you’ll be using later to log in from your Julia master node. The nice thing about ssh-copy-id is that it won’t add your key twice if you accidentally run it twice for the same remote machine. Hope that’s helpful. -s - show quoted text - Ismael Venegas Castelló 9/26/15 Hi Spencer! I'll try to check out if those programs are already installed. Thank you very much! Cheers - show quoted text - michae...@gmail.com 9/27/15 A Julia cluster is just a cluster with Julia installed on all the nodes. One way of achieving this is to create a cluster using PelicanHPC, and then do one of: 1) install julia in the /home/user directory, for example, by compiling from source. This directory is NFS shared by all nodes, when the cluster is set up. or 2) run apt-get install julia on all the nodes. A PHPC cluster is a reasonable solution for a single user. I used to develop it, and used it for a number of years on a 4 node cluster. It's Debian-based. - show quoted text - Ismael Venegas Castelló 10/28/15 How can I start 2 workers on each node, using Julia 0.3.11? [count*][user@]host[:port] [bind_addr[:port]] I have a machine file, with only one node (one line), this examples are the ways it works, but adding only one worker per node, I'm using the default port for now and not using a different bind address: Only host: 555.555.555.555 User and host: root@555.555.555.555 The way I understand: [count*][user@]host[:port] [bind_addr[:port]] Is that `count` is an integer while `*` means zero or more repetitions in REGEX lang, at first it seems it doesn't need a space character between the count and the `user@host`, but I have tried several forms and it doesn't work: * Use `2` as `count`, separated by space, with `my_file` being either: 2 555.555.555.555 or 2 root@555.555.555.555 [root@example ~]# julia --machinefile my_file ssh: connect to host 2 port 22: Invalid argument It seems to me it tries to use the 2 as the host address :( Could anyone please give me an example off a machine file which specifies the worker count? Thanks in advance, cheers! - show quoted text - Greg Plowman 10/28/15 On v0.3 try multiple entries (lines) in machine file, one for each worker. Ismael Venegas Castelló 10/28/15 Hello everyone, I have succesfully added all nodes and I can init julia like this: [root@hd0 ~]# julia -p 2 --machinefile Beowulf _ _ _ _(_)_ | A fresh approach to technical computing (_) | (_) (_) | Documentation: http://docs.julialang.org _ _ _| |_ __ _ | Type "help()" for help. | | | | | | |/ _` | | | | |_| | | | (_| | | Version 0.3.11 (2015-07-27 06:18 UTC) _/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release |__/ | x86_64-unknown-linux-gnu julia> nprocs() 22 julia> nworkers() 21 julia> Where Beowulf file is like this: hd1 hd2 hd3 hd4 hd5 hd6 hd7 hd8 hd9 hd10 hd11 hd12 hd13 hd14 hd15 hd16 hd17 hd18 hd19 If I change it to: 2 hd1 2 hd2 2 hd3 2 hd4 2 hd5 2 hd6 2 hd7 2 hd8 2 hd9 2 hd10 2 hd11 2 hd12 2 hd13 2 hd14 2 hd15 2 hd16 2 hd17 2 hd18 2 hd19 I get the same error I mentioned: [root@hd0 ~]# julia -p 2 --machinefile Beowulf2 ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ssh: connect to host 2 port 22: Invalid argument ^CERROR: interrupt in match at ./regex.jl:119 in parse_connection_info at multi.jl:1090 in read_worker_host_port at multi.jl:1037 in read_cb_response at multi.jl:1015 in start_cluster_workers at multi.jl:1027 in addprocs_internal at multi.jl:1234 in addprocs at multi.jl:1244 in process_options at ./client.jl:240 in _start at ./client.jl:354 [root@hd0 ~]# El viernes, 25 de septiembre de 2015, 16:42:59 (UTC-5), Ismael VC escribió: - show quoted text - Seth 10/28/15 On Wednesday, October 28, 2015 at 10:20:00 AM UTC-7, Ismael VC wrote: How can I start 2 workers on each node, using Julia 0.3.11? [count*][user@]host[:port] [bind_addr[:port]] The way I understand: [count*][user@]host[:port] [bind_addr[:port]] Is that `count` is an integer while `*` means zero or more repetitions in REGEX lang, at first it seems it doesn't need a space character between the count and the `user@host`, but I have tried several forms and it doesn't work: I don't think your interpretation is correct. I think the "*" is syntax for "(this many) times". Did you try appending an asterisk after the number? That is, "2* user@host "? Ismael Venegas Castelló 10/28/15 Thank you very much Greg that worked! :D El miércoles, 28 de octubre de 2015, 13:31:53 (UTC-6), Greg Plowman escribió: On v0.3 try multiple entries (lines) in machine file, one for each worker. Ismael Venegas Castelló 10/28/15 Thank you Seth, the count arg is not supported in 0.3.x, I'll update shortly to 0.4.x - show quoted text - Dan 10/29/15 perhaps the [count*] notation means: repeat the line count times i.e.: hd1 hd1 hd2 hd2 : no time to dig or test yet, so this is just another guess. - show quoted text - - show quoted text - ...