Groups 71 of 99+ julia-users › Optimizing embarrassingly parallel computation with SharedArrays and custom network topology 1 post by 1 author johnhold... gmail.com Apr 19 I have an embarrassingly parallel problem with the following features: I need to run a pure function on different inputs a lot of times I need to record the results of every call to this function I need to pass a lot of information as input to this function, and in particular the result needs to match to the particular input. But suppose temporarily that all the inputs are fixed except just for say just a number indexing which computation I'm doing see example below . I have successfully run examples of solving these problems on a single machine and on a multi-machine cluster with passwordless SSH using three distinct approaches. Apologies for lack of MWEs here; I will add if necessary, but should be simple enough and applicable to any example where pure_function below is doing a very small amount of work. In the single machine case, use a SharedArray, and do sync parallel for i in 1:N my_shared_array i pure_function i, other fixed inputs end In the multi machine case, I can't use a SharedArray. So I can do essentially the identical thing with a DistributedArray, or Essentially the same thing where instead of using parallel for, I just use pmap. Also another version where I essentially tried manually to recreate parallel for by sending remote calls, waiting for workers to finish, and then feeding them the next task, but this didn't outperform the others and anyway I'd prefer to stick to the realm of canonical safe approaches. When I time these I'll provide specific benchmarks if necessary on my local machine ONLY, 1 is way faster than either 2 or 3. The slowness of 3 in particular is consistent with the documentation, since pure_function isn't doing much work--I have relatively a high number of calls, and relatively a small amount of work being done in each call--so parallel should be the better solution than pmap. Interlude--I've also tried another approach, say 1A, where I don't use a SharedArray at all, but just use vcat as the reduction for parallel for, as suggested e.g. here. But this is of course slower than all of these approaches when the array is long enough, IIUC b c calling reduce on vcat is doing a lot of unnecessary traversals instead of just smartly aggregating within processes and then pasting together the completed results of the processes. BTW, I've also tried to write a modified parallel that doesn't try to reduce but just collects the values, but I failed to get it to work. Benchmarking these results is slighty complicated by my uncertainty over how much of the message-passing overhead is due to passing the inputs to the processes versus how much is due to the method of allocating the work to available workers versus in turn the overhead of collecting all the results. Based on a comparing the speed on my local machine of approach 1 to approaches 2 3, I believe SharedArrays are ideal for work on a single machine. Based on b comparing the speed of 2 3 on my local machine only to the speed of 2 3 when I have multiple machines, using both the all_to_all and master_slave topology, it seems that there is a lot of additional overhead in sending information across the network to each of the processes on the remote machines . Note I don't have independent evidence of this; perhaps Julia is already cleverly in the background deciding that it only needs to communicate across the network once per machine rather than once per process. But IIUC this is not so, and either way the overhead of communicating with additional processes on remote machines is so far so high that I basically get no gain from adding them. Consider these observations, it seems that the ideal approach would look something like this: Setup a custom network topology where the master process has connections only to single lieutenant processes on worker machines, and Per-machine, the lieutenant process including the original master process creates a SharedArray, adds local processes, and distributes work to all the local-only nodes Therefore the initial inputs and the per-machine SharedArrays are the only entities passed across the network minimize network communication, which IIUC is the slowest inter-process communication , and then on a given machine as little as possible is passed between processes. My questions are: Is this really the ideal approach? Or is there a better way? If that really is the ideal approach, how can I do it? I've tried tinkering with the spawn lieutenant-worker processes on other machines and then make them spawn their own worker processes idea, but I get errors because the remote lieutenants aren't the master process and don't seem to be allowed to spawn more workers; this suggested to me that I must be thinking about the problem the wrong way, since I probably shouldn't have to modify functions in Base in order to get this to work. So A literally how can I do it, and B is there a way to do it without digging too deep into the implementation, but just solving the problem generically at a high level? Has no one else encountered this particular issue? I've seen so many excellent examples of distributed computing in these forums and elsewhere, yet they all stick either with distributed arrays for multi-machine work or SharedArrays for local work, i.e. thinking either about parallelizing across some local processes or setting up a network of machines and parallelizing across a bunch of possibly remote processes, but nothing directly tackling the issue of discriminating among worker processes depending on whether they're local or on a remote machine. Addendum...here is a subset of things I've read on the topic, many of which have proved useful food for thought, despite comment 3 above. http: docs.julialang.org en latest manual parallel-computing https: github.com JuliaLang julia pull 11665 https: github.com JuliaLang julia issues 3655 https: github.com proflage 2015-julia-hands-on https: groups.google.com d topic julia-users 1sNXYtIbS1Q discussion https: groups.google.com d topic julia-users E8fGIiDwckc discussion https: groups.google.com d topic julia-users W5QIfE7f0O4 discussion https: groups.google.com d topic julia-users WJBIAYzrZgg discussion https: groups.google.com d topic julia-users b4tEzOOOnJI discussion https: groups.google.com d topic julia-users fMTwlMJKNVw discussion https: groups.google.com d topic julia-users fe1yZawvvi0 discussion https: groups.google.com d topic julia-users u4tJSCX4jxw discussion https: groups.google.com d topic julia-users vUpyqIxz3Hw discussion https: groups.google.com d topic julia-users vt2hS9h36a0 discussion