Groups 2 of 99+ julia-users › Implementing mapreduce parallel model not general multi-threading ? easy and enough ? 12 posts by 7 authors cheng wang 10 5 15 Hello everyone, I am a Julia newbie. I am thrilled by Julia recently. It's an amazing language! I notice that julia currently does not have good support for multi-threading programming. So I am thinking that a spark-like mapreduce parallel model + multi-process maybe enough. It is easy to be thread-safe and It could solve most vector-based computation. This idea might be too naive. However, I am happy to see your opinions. Thanks in advance, Cheng Andrei Zh 10 6 15 Julia supports multiprocessing pretty well, including map-reduce-like jobs. E.g. in the next example I add 3 processes to a workgroup , distribute simulation between them and then reduce results via + operator: julia addprocs 3 3-element Array Int64,1 : 2 3 4 julia nheads parallel + for i 1:200000000 Int rand Bool end 100008845 You can find full example and a lot of other fun in official documentation on parallel computing: http: julia.readthedocs.org en latest manual parallel-computing Note, though, that it's not real i.e. Hadoop Spark-like map-reduce, since original idea of MR concerns distributed systems and data-local computations, while here we do everything on the same machine. If you are looking for big data solution, search this forum for some dead or alive projects for it. Stefan Karpinski 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? That works fine in a distributed setting if you start Julia workers on other machines, so it is actually a legitimate form of map reduce. It doesn't do anything for handling machine failures, however, which was arguably the major concern of the original MapReduce design. Andrei Zh 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? Yet, calling Julia processes on other machines via ssh doesn't address data locality. In big data systems say, 1TB main performance concern is not a number of CPUs, but IO operations and data movement across a cluster, so map reduce tries to do as much as possible on local data without any movement map phase and then combine results globally reduce phase . This way little program is send to data nodes instead of huge data being sent to program's node s . As far as I know, Julia doesn't provide any tools for working with huge distributed datasets, that's why I say it doesn't give you Hadoop- or Spark-, or Google-like map-reduce. But it's quite easy to add these features of MR too. E.g. one can use Elly-jl to access HDFS including location of data blocks and execute tasks using remotecall on a Julia worker which is closest to data. Tim Holy 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? There's https: github.com JuliaParallel DistributedArrays-jl https: github.com JuliaParallel HDFS-jl in case they help. See the other packages in JuliaParallel, in case you have missed that organization. --Tim On Tuesday, October 06, 2015 12:57:17 PM Andrei Zh wrote: Yet, calling Julia processes on other machines via ssh doesn't address data locality. In big data systems say, 1TB main performance concern is not a number of CPUs, but IO operations and data movement across a cluster, so map reduce tries to do as much as possible on local data without any movement map phase and then combine results globally reduce phase . This way little program is send to data nodes instead of huge data being sent to program's node s . As far as I know, Julia doesn't provide any tools for working with huge distributed datasets, that's why I say it doesn't give you Hadoop- or Spark-, or Google-like map-reduce. But it's quite easy to add these features of MR too. E.g. one can use Elly-jl to access HDFS including location of data blocks and execute tasks using remotecall on a Julia worker which is closest to data. On Tuesday, October 6, 2015 at 8:03:57 PM UTC+3, Stefan Karpinski wrote: That works fine in a distributed setting if you start Julia workers on other machines, so it is actually a legitimate form of map reduce. It doesn't do anything for handling machine failures, however, which was arguably the major concern of the original MapReduce design. On Tue, Oct 6, 2015 at 10:24 AM, Andrei Zh faithle... gmail.com David van Leeuwen 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? See also an earlier discussion on a similar topic, for an out-of-core approach. ---david Stefan Karpinski 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. The only codes that really nail it are carefully handcrafted HPC codes. Andrei Zh 10 6 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? In my experience, Hadoop is pretty terrible about minimizing data movement; Spark seems to be significantly better. If you mean MapReduce the framework, version 1 or 2 , it doesn't move data anywhere unless you tell it to do so in reduce phase. You could experience another issue with MR1 - multiple reads and writes to disk on multistage jobs, which makes them terrrrribly slow. Recall, that Hadoop was born to efficiently and reliably download and store millions of web pages obtained using Nutch, not to write iterative machine learning algorithms. The only codes that really nail it are carefully handcrafted HPC codes. Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code. Steven Sagaert 10 7 15 Re: julia-users Re: Implementing mapreduce parallel model not general multi-threading ? easy and enough ? I think what is meant is that in HPC typically this is done via MPI which is just a low level approach where you explicitely have to specify all the data communication compared to Hadoop & Spark where it is implicit . The only codes that really nail it are carefully handcrafted HPC codes. Could you please elaborate on this? I think I know Spark code quite well, but can't connect it to the notion of handcrafted HPC code. cheng wang 10 7 15 Thanks all for replying. I have read th parallel computing document before I post this. Actually, what I mean is a shared memory model not a distributed model. My daily research involves extensively using of blas and parallel for-loop. Julia has a perfect support for blas, as well parallel for-loop could be solved by multi-process. However, if I want to have a shared array that could do efficient blast and parallel for-loop in the same time, what is the best solution ?? Jonathan Malmaud 10 7 15 Within the next few days, support for native threads will be merged into to the development version of Julia https: github.com JuliaLang julia pull 13410 . You can also used the SharedArray type which Julia already has, which lets multiple Julia processes running on the same machine share memory. You would use the standard Julia task-parallel tools like parfor, etc. in that model. cheng wang 10 7 15 Thx a lot. You saved my life :