Groups 1 of 99+ julia-users › Embarrassingly parallel workload 16 posts by 8 authors Júlio Hoffimann 8/9/15 Hi, Suppose I have a complicated but embarrassingly parallel loop, namely: https://github.com/juliohm/ImageQuilting.jl/blob/master/src/iqsim.jl#L167 How would you dispatch the iterations so that all cores in the client computer are busy working? There is any construct in the language for that usual scenario? -Júlio Tim Holy 8/9/15 Completely possible with SharedArrays, see http://docs.julialang.org/en/latest/manual/parallel-computing/#shared-arrays --Tim - show quoted text - Júlio Hoffimann 8/9/15 Thank you Tim, will check it carefully. -Júlio Júlio Hoffimann 8/9/15 Consider the following simplified example. There is an algorithm implemented as a function foo(N). This algorithm repeats the same recipe N times in a loop to fill in an array of arrays: function foo(N) # bunch of auxiliary variables goes here # ... result = [] for i=1:N # complicated use of auxiliary variables to produce result[i] # ... push!(result, rand(3,3)) # push result[i] to the vector end return result end Each iteration is expensive and we would like to add an option foo(N; parallel=false) to use all cores available. I would start with: function foo(N; parallel=false) if parallel addprocs(CPU_CORES - 1) end # bunch of auxiliary variables goes here # ... result = SharedVector(???) @parallel for i=1:N # complicated use of auxiliary variables to produce result[i] # ... push!(result, rand(3,3)) end return result end How to initialize the SharedVector? Is this the correct approach? What about the auxiliary variables, will them be copied to all workers? If yes, I can just replace Arrays by SharedArrays so that they are visible everywhere without copying? -Júlio Júlio Hoffimann 8/10/15 What am I doing wrong in the following code? function foo(N; parallel=false) if parallel && nprocs() < CPU_CORES addprocs(CPU_CORES - nprocs()) end result = SharedArray(Float64, 9, N) @parallel for i=1:N sleep(1) result[:,i] = rand(3,3)[:] end result end If I call foo(60, parallel=true), result is all zeros. Expected behavior is a random matrix. -Júlio Kristoffer Carlsson 8/10/15 Something like this? @everywhere function fill(A::SharedArray) for idx in Base.localindexes(A) A[idx] = rand() end end function fill_array(m, n) A = SharedArray(Float64, (m, n)) @sync begin for p in procs(q) @async remotecall_wait(p, fill, A) end end A end fill_array(9, 60) - show quoted text - Júlio Hoffimann 8/19/15 Hi Kristoffer, sorry for the delay and thanks for the code. What I want to do is very simple: I have an expensive loop for i=1:N such that each iteration is independent and produces a large array of size M. The result of this loop is a matrix of size MxN. I have many CPU cores at my disposal and want to distribute this work. In the past I accomplished that with MPI in Python: https://github.com/juliohm/HUM/blob/master/pyhum/utils.py Whenever a process in the pool is free it "consumes" an iteration of the loop. What exactly the @parallel macro in Julia is doing? How can I modify the code I previously posted in this thread to achieve such effect? -Júlio Ismael Venegas Castelló 8/19/15 There is an MPI wrapper for Julia, I don't know if it'll suit your needs thoug: https://github.com/JuliaParallel/MPI.jl - show quoted text - Júlio Hoffimann 8/19/15 Hi Ismael, MPI is distributed memory, I'm trying to use all the cores in my single workstation with shared memory instead. Thanks for the link anyways. -Júlio Sebastian Nowozin 8/19/15 Hi Julio, I believe this is a very common type of workload, especially in scientific computing. In C++ one can use OpenMP for this type of computation, in Matlab there is parfor. From the users perspective both "just work". In Julia, I have not found an easy and convenient way to do such computation. The difficulties I have experienced, trying to do this with distributed arrays and the Julia parallel operations: 1. Having to prepend @parallel before every import/require so that all parallel processes have all definitions. 2. Working with the distributed arrays API has given me plenty of headaches; it is more like programming with local/global contexts in OpenCL. 3. (I believe this is fixed now.) There were garbage collection issues and crashes on Windows when using distributed arrays. What would be very convenient is a type of OpenMP like parallelism, really anything that can enable us to write simply function compute(X::Vector{Float64}, theta) n = length(X) A = zeros(Float64, n) @smp for i=1:n A[i] = embarassing_parallel(X[i], theta); end A end Where @smp would correspond to "#pragma omp parallel for". I know this may be difficult to implement for a language as dynamic as Julia, but it is hard to argue against this simplicity from the users' point of view. As Clang/LLVM now support OpenMP (https://clang-omp.github.io/), one perhaps can recycle the same OpenMP runtime for such lightweight parallelism? Thanks, Sebastian - show quoted text - Júlio Hoffimann 8/19/15 Hi Sebastian, thanks for sharing your experience in parallelizing Julia code. I used OpenMP in the past too, it was very convenient in my C++ codebase. I remember of an initiative OpenACC that was trying to bring OpenMP and GPU accelerators together, I don't know the current status of it. It may be of interest to Julia devs. -Júlio Nils Gudat 8/20/15 Sebastian, I'm not sure I understand you correctly, but point (1) in your list can usually be taken care of by wrapping all the necessary usings/requires/includes and definitions in a @everywhere begin ... end block. Julio, as for your original problem, I think Tim's advice about SharedArrays was perfectly reasonable. Without having looked at your problem in detail, I think you should be able to do something like this (and I also think this gets close enough to what Sebastian was talking about, and to Matlab's parfor, unless I'm completely misunderstanding your problem): nprocs()==CPU_CORES || addprocs(CPU_CORES-1) results = SharedArray(Float64, (m,n)) @sync @parallel for i = 1:n results[:, i] = complicatedfunction(inputs[i]) end Ryan Cox 8/20/15 Sebastian, This talk from JuliaCon 2015 discusses progress on OpenMP-like threading: Kiran Pamnany and Ranjan Anantharaman: Multi-threading Julia: http://youtu.be/GvLhseZ4D8M?a Ryan - show quoted text - Sebastian Nowozin 8/20/15 Thanks Ryan for the pointer, this is awesome work, I am looking forward to this becoming part of the Julia release in Q3. Sebastian - show quoted text - Christopher Alexander Jan 30 I have tried the construction below with no success. In v0.4.3, I end up getting a segmentation fault. In the latest v.0.5.0, the run time is 3-4x as long as the non-parallelized version and the array constructed is vastly different than the one that is constructed using the non-parallelized code. Below is the C++ code that I am essentially trying to emulate: void TreeLattice::stepback(Size i, const Array& values, Array& newValues) const { #pragma omp parallel for for (Size j=0; jimpl().size(i); j++) { Real value = 0.0; for (Size l=0; limpl().probability(i,j,l) * values[this->impl().descendant(i,j,l)]; } value *= this->impl().discount(i,j); newValues[j] = value; } } The calls to probability, descendant, and discount all end up accessing data in other objects, so I tried to prepend those function and type definitions with @everywhere. However, that started me on a long chain of having to eventually wrap each file in my module in @everywhere, and there were still errors complaining about things not being defined. At this point I am really confused as to how to construct what would appear to be a rather simple parallelized for loop that generates the same results as non-parallelized code. I've poured over both this forum and other resources, and nothing has really worked. Any help would be appreciated. Thanks! Chris - show quoted text - Christopher Alexander Jan 30 By "construction below", I mean this: results = SharedArray(Float64, (m,n)) @sync @parallel for i = 1:n results[:, i] = complicatedfunction(inputs[i]) end - show quoted text -