Groups 180 of 99+ julia-users › Performance tips for network data transfer? 4 posts by 3 authors Matthew Pearce Aug 12 Dear Julians I'm trying to speed up the network data transfer in an MCMC algorithm. Currently I'm using `remotecall` based functions to do the network communication. Essentially for every node I include I scale up by I incur about 50mb of data transfer per iteration. The topology is that various 1mb vectors get computed on worker nodes and transferred back to a central node. The central node does some work on the vector and sends back a copy of the resulting vector same size to each worker node. Now I'm doing the send and receive transfers asynchronously, but it's scaling quite badly because the network transfer complexity is O nodes vectors and the constants are big. This makes me think that there's some work going on like the same vector being serialized on the central node for each transfer to another node. Is there a way to only incur serialization preparation costs once on the central worker, when the same data is transferred to multiple workers? Is it likely to help if I write branching code 1 sends to 2 , 1 and 2 send to 3 and 4 , 1,2,3,4 send to 5,6,7,8 ? Alternately is there any way of using other, faster technologies from within the REPL? My cluster supports MPI, and also I have GPUs with infiniband connections. My appetite for messing around with this to achieve better performance is quite high. Cheers in advance Matthew Amit Murthy Aug 12 Are the constants the same across iterations? If so, you may find a CachingPool useful - http: docs.julialang.org en latest stdlib parallel Base.CachingPool Is there a way to only incur serialization preparation costs once on the central worker, when the same data is transferred to multiple workers? Time the amount of time it takes to serialize your data to an IOBuffer. This will give you an idea if there are any benefits to serializing first to an IOBuffer and then sending that once to every worker. We don't yet have a network optimized construct like MPI Broadcast everywhere does serialize separately to each worker . Is it likely to help if I write branching code 1 sends to 2 , 1 and 2 send to 3 and 4 , 1,2,3,4 send to 5,6,7,8 ? How many workers do you have? I doubt you will much benefits for upto a couple of 100 workers for the sizes you mentioned. Alternately is there any way of using other, faster technologies from within the REPL? My cluster supports MPI, and also I have GPUs with infiniband connections. Package MPI-jl supports using MPI as transport. There is some more work required to optimize the use of MPI transport and support MPI broadcast. This is WIP. My appetite for messing around with this to achieve better performance is quite high. Good to hear. If you have narrowed down specific performance issues they can be worked on at MPI-jl Julia repos Jared Crean Aug 12 The MPI-jl package also supports calling MPI routines directly. If you are transferring arrays of immutables, they can be sent with no overhead serialization or otherwise . The limitation is mutable types cannot be sent but they can be using the remotecall framework . Jared Crean Matthew Pearce Aug 17 Thanks for the thoughts people, much appreciated, gives me some ideas to work with. I'm going to play around with pure Julia solutions first as my prior experience trying to get MPI-jl running on my cluster in a REPL was painful. This could be the wrong attitude and I may have to change it. Workers will be in the low tens as I only need one per compute node.