Parallel Computing Planning Issue 9167 Closed alanedelman opened this Issue on Nov 26, 2014 · 53 comments Projects None yet Labels parallel Milestone No milestone Assignees No one assigned 22 participants alanedelman tknopp andreasnoack elextr amitmurthy eschnett tkelman poulson ViralBShah jakebolewski wildart classner timholy StefanKarpinski kpamnany Keno skariel pao garborg ChrisRackauckas IainNZ and others Notifications alanedelman alanedelman commented on Nov 26, 2014 As we talk about parallel computing in various contexts, it might be handy to keep in mind all the high level moving parts. Here are some that come to mind, just to get the ball rolling. Hardware Shared Memory Machines Threads: See 1790, 1802 Distributed Memory especially networks of shared memory spawns, MPI?, migrating threads? GPUs? we do don't believe these are here to stay Tools Nice Graphical Performance Tools and Instrumentation My first wish: just have Green working Red idle on every processor easily every time I do parallel julia Programming Models Persistent Data as in Global Array Syntax Darrays Star-P and others CILK Spawns Map reduce etc. Communication Layers ZMQ, MPI, sockets Schedulers ?? Does one master know all state? Can our serial model be in conflict with various parallel models? Libraries Scalapack, parallel FFTW?, sparse matrices, PETSc , star-p like to compare with more pure julia methods? ViralBShah ViralBShah added the parallel label on Nov 26, 2014 tknopp tknopp commented on Nov 26, 2014 Regarding shared memory multithreading the threads branch seems to be the closest thing. 6741 was my experiment on multithreading and this work is superseded by the threads branch. andreasnoack The Julia Language member andreasnoack commented on Nov 26, 2014 Faster data movement This is partly about communication layers and partly about libraries. I'd like to write pure julia implementations of large scale parallel dense linear algebra functions, but it is difficult to get good perfomance in the present state of Julia's parallel functionality because the communication is slow. The graphs below show timings for moving data to a remote process as a function of array size for present spawn, spawn with the changes proposed in 6876 and send and Send from MPI-jl. muck svg Notice the log-scale which means that right now spawns are ten times slower than MPI for large arrays and even slower for small arrays, although the variance is very large for small array spawns. Also relevant: 9046. elextr elextr commented on Nov 26, 2014 I'd add one to the hardware section - NUMA aware memory strategies amitmurthy The Julia Language member amitmurthy commented on Nov 26, 2014 addprocs on demand. What I mean by this is that any user should be able to summon thousands of cores, run a parallel algorithm and pay only for the minutes used. This will be in a separate package, but changes will be required in Base to support this. A possible model could be JuliaLang backed docker images that can be run on various cloud infrastructures. The package accepts the users credentials and manages the launch, tear down and billing of julia workers used. eschnett eschnett commented on Nov 26, 2014 I'm slightly confused about how Julia's talking model will evolve in the presence of multi-threading. I always assumed that Julia's tasks, which are currently executed serially, would be executed in parallel in the future. However, when I look at the threading branch that may not be the case. Has there been a discussion about this? tkelman The Julia Language member tkelman commented on Nov 26, 2014 andreasnoack that is excellent data, thanks for collecting that I'll note that sadly none of the open-source MPI implementations can build on non-posix platforms right now, but I've got some back-burner plans that involve filling in whatever's missing possibly involving libuv along the way and trying to fix them. Almost noone uses Windows for HPC, but it's mildly annoying that I can't run MPI libraries on 8 local cores if I happen to be using Windows. There is a Microsoft MPI implementation but I doubt many computational MPI libraries work with it. We should still do as much as possible to make Julia play well with MPI, cross-platform or no. To the Libraries list I'd like to add https: github.com elemental Elemental, poulson has expressed interest in adding Julia bindings and it would be good to know if we need to change anything to make that happen. eschnett eschnett commented on Nov 26, 2014 Microsoft's MPI seems to be built on MPICH, one of the mainstream MPI implementations. I'm reading the description, I haven't tried it. MPI implementations tends to be very portable i.e. very close to the standard , and I would thus expect this implementation to work out of the box with any MPI-using application under Windows. poulson poulson commented on Nov 26, 2014 tkelman I would assume that I can manually expose an IJulia interface in the same way that I did for IPython see http: web.stanford.edu poulson notes 2014 11 11 Elemental-IPython.html . I'm currently wrapping up initial dense and sparse-direct distributed-memory implementations of Mehotra Predictor-Corrector Interior Point Methods for LPs though I am not yet specially handling small numbers of dense rows columns when forming the normal form of the ideally sparse KKT system . I hope to add distributed dense versions of QP IPMs soon, with distributed dense and sparse-direct SDPs being my ultimate goal. ViralBShah The Julia Language member ViralBShah commented on Nov 26, 2014 Why do the libraries not work with Microsoft MPI? What have they done? ViralBShah The Julia Language member ViralBShah commented on Nov 26, 2014 I wonder if much of the time spent is in spawning rather than data movement, at small sizes. tkelman The Julia Language member tkelman commented on Nov 26, 2014 poulson the IPython approach does look interesting, did not know about that see http: ipython.org ipython-doc 2 parallel index.html for anyone else who's curious . We should see how much of that also works through IJulia - maybe all of it does? Regarding the linear algebra details and optimization applications, do you have a roadmap issue? I'm hoping you won't limit yourself to convex problems, the nonlinear programming solvers of the world are desperately in need of redistributable parallel sparse linear algebra libraries, of which Elemental might be the first to actually fit the bill. ViralBShah I'm mostly worried about build systems not knowing how to find and use Microsoft's MPI implementation. It's also not open source so I can't cross-compile it so using WinRPM for binaries is out of the picture. poulson poulson commented on Nov 27, 2014 tkelman I apologize for taking this discussion a bit off-topic. But, somewhat in order, my current roadmap is robust sequential distributed dense sparse implementations of LPs, QPs, SOCPs, and SDPs which can make use of accelerators on each node. On the subject of parallelization with IJulia: I have little doubt that it would be straightforward. jakebolewski The Julia Language member jakebolewski commented on Nov 27, 2014 Has the parallel support in IPython been refactored enough to be python agnostic? Last time I used it the implementation was very much tied to using the python kernel. poulson poulson commented on Nov 27, 2014 jakebolewski I haven't checked. I'm not saying that it should be blackbox, but that it is conceptually straightforward. I would assume that a complete IJulia implementation would duplicate said functionality with Julia workers instead of Python workers. If someone is seriously interested in this, I'm more than happy to discuss further and possibly considering adding said functionality myself . amitmurthy The Julia Language member amitmurthy commented on Nov 27, 2014 poulson regular addprocs should work within an IJulia session too, right? And my understanding is that the workers will be around till the kernel is shutdown. poulson poulson commented on Nov 27, 2014 amitmurthy I was honestly not familiar with that function, and it looks like it might. But I wouldn't want to promise anything until I have a working prototype. poulson poulson commented on Nov 27, 2014 ViralBShah MPICH makes use of pthreads, which Windows does not support. tkelman The Julia Language member tkelman commented on Nov 27, 2014 winpthreads works reasonably well, to be fair, and it's a small permissively licensed library. LLVM might force us off the win32-threads version of MinGW at some point anyway. ViralBShah The Julia Language member ViralBShah commented on Nov 27, 2014 amitmurthy 's recent work on allowing different transports is worth trying out here on this simple benchmark. andreasnoack Can you create a set of benchmarks in test perf, and put this benchmark in there? We will certainly want to add to that list, and it would be great if everyone can run these. andreasnoack The Julia Language member andreasnoack commented on Nov 28, 2014 ViralBShah I'll try to compile some tests for this. wildart wildart commented on Dec 3, 2014 Here is an article by John Langford, Allreduce or MPI vs. Parameter server approaches, where he reviewed various parallelization approaches used in machine learning. andreasnoack andreasnoack referenced this issue on Jan 31, 2015 Open Speed of data movement in spawn 9992 classner classner commented on Feb 2, 2015 Hi everybody! As Julia is making great progress, it's becoming really interesting for image processing. Local threading is really important here, though. I dug through the various discussions and found this discussion and the threads branch. Apparently, the implementation in that branch is working. The basic tests are passing, though one test relying on the Images module currently doesn't work since the Images module requires functionality available only in the latest Julia version, and I didn't go through a merge since it couldn't be done automatically . What are the plans regarding local parallelism? Are there any plans to merge the threads branch? It looks promising! timholy The Julia Language member timholy commented on Feb 2, 2015 Other than a few experiments back in the very earliest days, I have not been involved in the threads branch at all. So take this with a big grain of salt. I think it's fair to say that there certainly are plans to merge the branch. Guessing here, but presumably what it needs is 1 some free time for its principal developers including one who needs to defend his thesis to finish not-yet-implemented features or bug fixes, and 2 people kicking the tires. There's a chance you might be able to help the process along by doing that merge you described above and submitting a PR against the threads branch. If you are brave and willing to start using it in regular work, continued PRs against the threads branch would surely go a long ways to speeding the merger. StefanKarpinski The Julia Language member StefanKarpinski commented on Feb 2, 2015 The plan is to merge the threading work into master soon, but leave it off by default since it's still quite experimental. The main purpose of merging is to avoid continually rebasing when people make changes to the internals on master that break the threading functionality. That should make it easier for people to try out threading functionality even though it will still be experimental during the 0.4 cycle. classner classner commented on Feb 2, 2015 That sounds very good! I had a quick look on what the main merge problems were. A critical part of changes concerns the garbage collection, which is at the very heart of Julia and cannot be merged without quite some reading knowledge of the internals. StefanKarpinski The Julia Language member StefanKarpinski commented on Feb 2, 2015 Yes, it's a pretty non-trivial merge. ViralBShah The Julia Language member ViralBShah commented on Feb 2, 2015 The GC patch was ready and tested, and merging it first meant that work remains to make it thread safe. As said before, we expect to have this in 0.4, even if disabled by default. There is a threading branch that is more recent than the threads branch, which is updated to just before the GC merge. Cc: kpamnany jakebolewski The Julia Language member jakebolewski commented on Feb 2, 2015 Does the threading branch work on OSX Windows or is it Linux only for the time being? ViralBShah The Julia Language member ViralBShah commented on Feb 2, 2015 I don't think anyone has tried it on OS X. I wouldn't even dare try windows. I personally have only tried it on Linux, and I also see a few segfaults. kpamnany kpamnany commented on Feb 3, 2015 Haven't tried OS X or Windows, but the former shouldn't be too hard to get running. The threading infrastructure mostly needs only pthreads and atomics. There's an LLVM dependency, for thread-locals, but Keno has a functional hack in an llvm-svn branch I don't know if this is Linux-specific though . Getting this functionality into LLVM is another TODO. Reentrancy in the runtime specifically in the new GC is the main block right now. And there are, of course, lots of functions in the standard library that need to be made thread-safe, or thread-safe versions added. Most crashes you'll find in the threading branch relate to such problems. Lots more functionality to add, but it mostly works, and scales quite nicely. :- See test perf threads laplace3d. Keno The Julia Language member Keno commented on Feb 3, 2015 The LLVM patch is Linux-only at this point, but give me a week or two. classner classner commented on Feb 3, 2015 Yes, the scaling is near perfect and the functionality looks very promising and impressive! Do you have plans already for the variable sharing policy? I.e., what variables will be treated thread-local and when are variables syncronized between threads. kpamnany kpamnany commented on Feb 3, 2015 Threads run Julia functions or blocks that are converted into functions . Block-scoped variables are on the stack, hence thread-local. Variables in an outer scope will be shared. Directive-based privatization a la OpenMP is not high on the priority list to be honest, it isn't on the list at all, but reductions are . Is this important? Synchronization support will initially be explicit, i.e. locks and atomics. These should eventually be used to build a library of concurrent data structures. kpamnany kpamnany commented on Feb 3, 2015 In response to eschnett from a while ago--sorry, only saw this when I was mentioned , the interaction between tasks and threads has been the subject of much discussion. Cilk-style parallelism i.e. fork join with work stealing will work best for this, but alternative shared memory parallel programming models are still some ways out. This is simple loop parallelism, and it isn't clear how tasks can should interact with this model. classner classner commented on Feb 3, 2015 Thanks for the detailed response! Having a possibility to synchronize, and the simple loop parallelism will be perfectly sufficient. I think the OpenMP directives do make sense to simplify data handling on NUMA architectures. At the same time, they're way out of scope for an experimental feature... StefanKarpinski The Julia Language member StefanKarpinski commented on Feb 3, 2015 off-topic: kpamnany, I love your avatar :- kpamnany kpamnany commented on Feb 3, 2015 StefanKarpinski, I figured it was the done thing for Julia devs, given yours and Jeff's. :- amitmurthy amitmurthy referenced this issue on Feb 18, 2015 Closed RFC - list of parallel multi-jl enhancements 3340 4 of 9 tasks complete skariel skariel commented on Feb 19, 2015 please consider nanomsg it is by the same author and supercedes zeroMQ pao The Julia Language member pao commented on Feb 19, 2015 Julia already supports ZMQ--it's needed to communicate with IPython Jupyter , so it's sensible to pursue. I think the transports are supposed to be pluggable, so if you're interested in a nanomsg transport, you're welcome to build it! garborg The Julia Language member garborg commented on Feb 19, 2015 skariel A sensible place to start: https: github.com quinnj Nanomsg-jl ChrisRackauckas ChrisRackauckas commented on Apr 8, 2015 I was wrong! Good work! IainNZ The Julia Language member IainNZ commented on Apr 8, 2015 ChrisRackauckas do you mean fusing multiple linear algebra operations? Because aren't all the basic linear algebra operations parallel already? As for for loops, we have parallel already but soon enough we'll have threads too. ChrisRackauckas ChrisRackauckas commented on Apr 9, 2015 Wow, turns out I was just wrong. When I last used Julia I don't remember this, but I just ran a simple test with multiplying two random matrices of size 10,000 x 10,000 and it did max all my cores. I think that it should be noted in the documentation what other functions are parallel? . Also, did not know about parallel. FYI, hearing this and noticing that MATLAB's native parallelization doesn't come close to using the full power of my machine... finally makes me a convert. The only thing I feel like I am missing is easy CUDA support i.e. as simple as send the array over to the GPU and now all matrix operations and standard functions are performed on the GPU but that's nice but not truly necessary. IainNZ The Julia Language member IainNZ commented on Apr 9, 2015 Not sure which operations exactly, but basically linear algebra on Float64s for sure. Definitely check out the manual, it covers the parallel computing functionality of Julia pretty well - PRs welcome for improvements as well. CUDA support isn't a base Julia issue IMO given its dependent on an external closed source library, but there is JuliaGPU CUBLAS-jl which is getting there for higher level abstractions: also contributions welcome there, I'm sure A rand elty,m,n d_A CudaArray A test y alpha A x x rand elty,n d_x CudaArray x y1 alpha A x y2 A x d_y1 CUBLAS.gemv 'N',alpha,d_A,d_x d_y2 CUBLAS.gemv 'N',d_A,d_x h_y1 to_host d_y1 h_y2 to_host d_y2 test_approx_eq y1,h_y1 test_approx_eq y2,h_y2 ChrisRackauckas ChrisRackauckas commented on Apr 9, 2015 The only mention I could find of that functionality is hinted at in the peakflops function on this page: http: julia.readthedocs.org en latest stdlib linalg . There is no mention in the parallel computing page http: julia.readthedocs.org en latest manual parallel-computing or the linear algebra page http: julia.readthedocs.org en latest manual linear-algebra . johnmyleswhite The Julia Language member johnmyleswhite commented on Apr 9, 2015 Arguably this isn't something that should be documented as Julia's behavior since it's really the behavior of OpenBLAS. StefanKarpinski The Julia Language member StefanKarpinski commented on Apr 9, 2015 ChrisRackauckas, where did the impression that Julia was not maxing out a system's cores when doing things like matmul come from? Doing so is a matter of using a well-tuned multithreaded BLAS – which we always have, albeit a different one from Matlab OpenBLAS vs. MKL . There are other vectorized operations where Matlab has hand-coded implementations in Clanguage that use multiple cores, while Julia's implementation, being written in Julia, uses a single core. Often that deficit can be compensated for by devectorizing the code and improving the serial performance. In the future, Julia will support threading natively, and that issue will go away too. The distributed computing issue is largely orthogonal to all of this, but it also important if one wants to do HPC-style work. ViralBShah The Julia Language member ViralBShah commented on Apr 9, 2015 Once we have multi-threaded Julia, it won't be only hand coded internal functions in Clanguage that will be multi-threaded, but the entire Julia Base library. So while we have some distance to cover, we will have something great in the 0.5 timeframe. ChrisRackauckas ChrisRackauckas commented on Apr 9, 2015 Thank you for the explanation, that clears it up. I look forward to multi-threaded Julia. ViralBShah The Julia Language member ViralBShah commented on Aug 7, 2015 This planning issue seems too broad. Does it break into smaller tasks that can have their own issues? I am tempted to close this, and have specific issues dealing with specific problems. jeff-regier jeff-regier referenced this issue in jeff-regier Celeste-jl on Sep 1, 2015 Merged Parallelize elbo 55 ViralBShah The Julia Language member ViralBShah commented on Sep 6, 2015 I am closing this as very generic. A bunch of specific things are being done already. ViralBShah ViralBShah closed this on Sep 6, 2015 classner classner commented on Sep 8, 2015 Where will the status of the threading branch be tracked? And what is its current status? That one was very promising! : ViralBShah The Julia Language member ViralBShah commented on Sep 8, 2015 1790