machinefile ssh improvements 7589 Closed wavexx opened this Issue on Jul 13, 2014 ยท 9 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 6 participants wavexx StefanKarpinski JeffBezanson sbromberger kshyatt amitmurthy Notifications wavexx wavexx commented on Jul 13, 2014 I was trying Julia recently, however I'm having somewhat an hard time to actually use all the cores that I have access to due to the way the workers are spawned. First: Is it somehow possible to enhance the machine file to have a: host :count format to specify the worker count, instead of repeating the same host 64 times? My current machinefile is a slew of repeated lines. It's actually hard to edit. Second: Connecting several times to the same host just to create n channels is a bit overkill, especially if we're talking ssh. In fact, I cannot have more than 20 workers to each host in my case. I was able to increase the count by using the connection multiplexing in ssh itself by manually configuring ControlMaster ControlPath , but I would definitely suggest to add: -o ControlMaster auto -o ControlPath path -o ControlPersist 5 when connecting via ssh, where path should be a temporary path unique to the current Julia master process being run, in order to avoid ControlMaster sharing among different master processes. This saves at least N-1 processes on the remote end besides eliminating the connection handshakes . ControlPersist could also be removed if the ControlMaster is managed by Julia, as opposed to ssh's auto feature. I would also strongly recommend ssh -T, in any case, to avoid a tty allocation we don't need one anyway , since this is another issue when requesting a large number of connections via ssh. So far this should be rather trivial to do, but it's still not optimal. It's quite obvious that running N instances of the same process on the same machine should be done by executing a single copy of julia --worker -p n , then fork n times just after bootstrap to share the initialization memory setup and use the same communication channel. Given the current non-negligible startup time of Julia, it would make a big difference. StefanKarpinski The Julia Language member StefanKarpinski commented on Jul 13, 2014 host:port is used to specify the SSH port number so that syntax proposal clashes. You could do host :port x count or something like that. The rest of these suggestions all seem like good ideas. JeffBezanson JeffBezanson added performance parallel and removed performance labels on Jul 13, 2014 JeffBezanson The Julia Language member JeffBezanson commented on Jul 13, 2014 Yes, these are great suggestions. Thanks. wavexx wavexx commented on Jul 14, 2014 x count is a bit weird as x would require whitespace without a port. What about: host count host:port count which mimics array syntax? This was referenced on Jul 14, 2014 Closed make julia -p N use fork instead of exec 985 Merged Add `-T` -a` to the default ssh command s. 7599 StefanKarpinski The Julia Language member StefanKarpinski commented on Jul 14, 2014 In Julia that's how you index into an array, not how you declare its size, so that seems weird. This would be an option: 16 host:1234 or host:1234 16. I don't really see why whitespace is a problem. wavexx wavexx commented on Jul 15, 2014 The current syntax is already defined to be: user host :port bind_addr so it's ambiguous for bind_addr though arguably you could check if it's an integer . What about: user host :port count bind_addr wavexx wavexx referenced this issue on Jul 15, 2014 Merged Support host count in machinefile 7616 wavexx wavexx commented on Jul 15, 2014 I added an initial version support count in the above pull request. This is just a stop-gap commit. My plan is to add a :n :count? argument to Base.addprocs machines, ... later, so that I can properly create a shared ssh channel for multiple workers on the same host using ControlPath ControlMaster, as described above. amitmurthy amitmurthy referenced this issue on Sep 11, 2014 Merged RFC: reworked cluster manager interface 8306 sbromberger sbromberger commented on Dec 13, 2014 Would it also be possible to specify the julia bin directory for the remote machines? I frequently have test builds in various directories on my local machine, but on my remotes they're in different directories usually usr bin . This is analogous to the dir kwarg for addprocs. Edit: I've created a PR for this: 9347. This was referenced on Dec 13, 2014 Closed added dirs option to machinefile parsing 9347 Open Machinefile nonuniform install locations 10474 kshyatt kshyatt commented on Sep 14 What's the status on these suggestions? cc amitmurthy amitmurthy The Julia Language member amitmurthy commented on Sep 14 When specified with a count, we now launch only one process on a remote node, and then launch additional workers via that initial instance. So for a directly accessible cluster there is only one ssh connection per node. Closing the issue. Please reopen if there are any other ssh specific suggestions. amitmurthy amitmurthy closed this on Sep 14