Groups 2 of 99+ julia-users › Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? 12 posts by 7 authors cheng wang 10/5/15 Hello everyone, I am a Julia newbie. I am thrilled by Julia recently. It's an amazing language! I notice that julia currently does not have good support for multi-threading programming. So I am thinking that a spark-like mapreduce parallel model + multi-process maybe enough. It is easy to be thread-safe and It could solve most vector-based computation. This idea might be too naive. However, I am happy to see your opinions. Thanks in advance, Cheng Andrei Zh 10/6/15 Julia supports multiprocessing pretty well, including map-reduce-like jobs. E.g. in the next example I add 3 processes to a "workgroup", distribute simulation between them and then reduce results via (+) operator: julia> addprocs(3) 3-element Array{Int64,1}: 2 3 4 julia> nheads = @parallel (+) for i=1:200000000 Int(rand(Bool)) end 100008845 You can find full example and a lot of other fun in official documentation on parallel computing: http://julia.readthedocs.org/en/latest/manual/parallel-computing/ Note, though, that it's not real (i.e. Hadoop/Spark-like) map-reduce, since original idea of MR concerns distributed systems and data-local computations, while here we do everything on the same machine. If you are looking for big data solution, search this forum for some (dead or alive) projects for it. - show quoted text - Stefan Karpinski 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? That works fine in a distributed setting if you start Julia workers on other machines, so it is actually a legitimate form of map reduce. It doesn't do anything for handling machine failures, however, which was arguably the major concern of the original MapReduce design. - show quoted text - Andrei Zh 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? Yet, calling Julia processes on other machines via ssh doesn't address data locality. In big data systems (say, > 1TB) main performance concern is not a number of CPUs, but IO operations and data movement across a cluster, so map reduce tries to do as much as possible on local data without any movement (map phase) and then combine results globally (reduce phase). This way little program is send to data nodes instead of huge data being sent to program's node(s). As far as I know, Julia doesn't provide any tools for working with huge distributed datasets, that's why I say it doesn't give you Hadoop- (or Spark-, or Google-like) map-reduce. But it's quite easy to add these features of MR too. E.g. one can use Elly.jl to access HDFS (including location of data blocks) and execute tasks using remotecall() on a Julia worker which is closest to data. - show quoted text - Tim Holy 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? There's https://github.com/JuliaParallel/DistributedArrays.jl https://github.com/JuliaParallel/HDFS.jl in case they help. (See the other packages in JuliaParallel, in case you have missed that organization.) --Tim On Tuesday, October 06, 2015 12:57:17 PM Andrei Zh wrote: > Yet, calling Julia processes on other machines via ssh doesn't address data > locality. In big data systems (say, > 1TB) main performance concern is not > a number of CPUs, but IO operations and data movement across a cluster, so > map reduce tries to do as much as possible on local data without any > movement (map phase) and then combine results globally (reduce phase). This > way little program is send to data nodes instead of huge data being sent to > program's node(s). > > As far as I know, Julia doesn't provide any tools for working with huge > distributed datasets, that's why I say it doesn't give you Hadoop- (or > Spark-, or Google-like) map-reduce. But it's quite easy to add these > features of MR too. E.g. one can use Elly.jl to access HDFS (including > location of data blocks) and execute tasks using remotecall() on a Julia > worker which is closest to data. > > On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: > > That works fine in a distributed setting if you start Julia workers on > > other machines, so it is actually a legitimate form of map reduce. It > > doesn't do anything for handling machine failures, however, which was > > arguably the major concern of the original MapReduce design. > > > > On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh > - show quoted text - David van Leeuwen 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? See also an earlier discussion on a similar topic, for an out-of-core approach. ---david - show quoted text - Stefan Karpinski 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. The only codes that really nail it are carefully handcrafted HPC codes. - show quoted text - Andrei Zh 10/6/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. If you mean MapReduce (the framework, version 1 or 2), it doesn't move data anywhere unless you tell it to do so in reduce phase. You could experience another issue with MR1 - multiple reads and writes to disk on multistage jobs, which makes them terrrrribly slow. (Recall, that Hadoop was born to efficiently and reliably download and store millions of web pages obtained using Nutch, not to write iterative machine learning algorithms.) The only codes that really nail it are carefully handcrafted HPC codes. Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code. - show quoted text - Steven Sagaert 10/7/15 Re: [julia-users] Re: Implementing mapreduce parallel model (not general multi-threading) ? easy and enough ? I think what is meant is that in HPC typically this is done via MPI which is just a low level approach where you explicitely have to specify all the data communication (compared to Hadoop & Spark where it is implicit). The only codes that really nail it are carefully handcrafted HPC codes. Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code. cheng wang 10/7/15 Thanks all for replying. I have read th parallel computing document before I post this. Actually, what I mean is a shared memory model not a distributed model. My daily research involves extensively using of blas and parallel for-loop. Julia has a perfect support for blas, as well parallel for-loop could be solved by multi-process. However, if I want to have a shared array that could do efficient blast and parallel for-loop in the same time, what is the best solution ?? - show quoted text - Jonathan Malmaud 10/7/15 Within the next few days, support for native threads will be merged into to the development version of Julia (https://github.com/JuliaLang/julia/pull/13410). You can also used the SharedArray type which Julia already has, which lets multiple Julia processes running on the same machine share memory. You would use the standard Julia task-parallel tools (like @parfor, etc.) in that model. - show quoted text - cheng wang 10/7/15 Thx a lot. You saved my life :) - show quoted text -