Groups 58 of 99+ julia-users › I can't believe this spped-up ! 15 posts by 6 authors Ferran Mazzanti Jul 21 Hi, mostly showing my astonishment, but I can even understand the figures in this stupid parallelization code A 1.0 1.0001 ; 1.0002 1.0003 z A tic for i in 1:1000000000 z A end toc A produces elapsed time: 105.458639263 seconds 2x2 Array Float64,2 : 1.0 1.0001 1.0002 1.0003 But then add parallel in the for loop A 1.0 1.0001 ; 1.0002 1.0003 z A tic parallel for i in 1:1000000000 z A end toc A and get elapsed time: 0.008912282 seconds 2x2 Array Float64,2 : 1.0 1.0001 1.0002 1.0003 look at the elapsed time differences! And I'm running this on my Xeon desktop, not even a cluster Of course A-B reports 2x2 Array Float64,2 : 0.0 0.0 0.0 0.0 So is this what one should expect from this kind of simple paralleizations? If so, I'm definitely in love with Julia : : : Best, Ferran. Chris Rackauckas Jul 21 I wouldn't expect that much of a change unless you have a whole lot of cores even then, wouldn't expect this much of a change . Is this wrapped in a function when you're timing it? Nathan Smith Jul 21 Hey Ferran, You should be suspicious when your apparent speed up surpasses the level of parallelism available on your CPU. I looks like your codes don't actually compute the same thing. I'm assuming you're trying to compute the matrix exponential of A A 1000000000 by repeatedly multiplying A. In your parallel code, each process gets a local copy of 'z' and uses that. This means each process is computing something like A 1000000000 of procs . Check out this section of the documentation on parallel map and loops to see what I mean. That said, that doesn't explain your speed up completely, you should also make sure that each part of your script is wrapped in a function and that you 'warm-up' each function by running it once before comparing. Cheers, Nathan Nathan Smith Jul 21 Try comparing these two function: function serial_example A 1.0 1.001 ; 1.002 1.003 z A for i in 1:1000000000 z A end return z end function parallel_example A 1.0 1.001 ; 1.002 1.003 z parallel for i in 1:1000000000 A end return z end Ferran Mazzanti Jul 21 I posted this because I also find the results... astonishingly surprising. Howeverm the timings are apparently real, as the first one took more than 1.5mins on my wrist watch, and the second calculation was instantly. And no, no function wrapping whatsoever... Ferran Mazzanti Jul 21 Hi Nathan, I posted the codes, so you can check if they do the same thing or not. These went to separate cells in Jupyter, nothing more and nothing less. Not even a single line I didn't post. And yes I understand your line of reasoning, so that's why I got astonished also. But I can see what is making this huge difference, and I'd like to know : Best, Ferran. Ferran Mazzanti Jul 21 Nathan, the execution of these two functions gives essentially the same timings, no matter of many processes I have added with addprocs Very surprising to me... Of course I prefer the speeded-up version : Best, Ferran. Nathan Smith Jul 21 To be clear, you need to compare the final 'z' not the final 'A' to check if your calculations are consistent. The matrix A does not change through out this calculation, but the matrix z does. Also, there is no parallelism with the parallel loop unless your start julia with 'julia -np N' where N is the number of processes you'd like to use. This message has been deleted. Chris Rackauckas Jul 21 Always wrap it in a function. But the real issue is that they don't evaluate to the same thing. I'd write it as const N 100000 function test1 A 1.0 1.0001 ; 1.0002 1.0003 z A for i in 1:N z A end z end function test2 A 1.0 1.0001 ; 1.0002 1.0003 z A parallel for i in 1:N z A end z end test1 test2 Test that the outputs are the same time test1 time test2 Notice the test is false. test1 gives a 2x2 matrix of infs, while test2 returns the same matrix as A. Adding parallel is changing the computation because it's using a local variable as Nathan has stated. Nathan Smith Jul 21 one typo from my functions. The serial version should be: function serial_example A 1.0 1.001 ; 1.002 1.003 z eye 2 for i in 1:1000000000 z A end return z end to be consistent. With 4 processors, I see a roughly 2x speed up for the parallel version and calculations are consistent. On Thursday, 21 July 2016 13:02:52 UTC-4, Nathan Smith wrote: in Jupyer notebook, add processors with addprocs N On Thursday, 21 July 2016 12:59:02 UTC-4, Nathan Smith wrote: To be clear, you need to compare the final 'z' not the final 'A' to check if your calculations are consistent. The matrix A does not change through out this calculation, but the matrix z does. Also, there is no parallelism with the parallel loop unless your start julia with 'julia -np N' where N is the number of processes you'd like to use. Attachments 1 Untitled.ipynb 2 KB Download Kristoffer Carlsson Jul 21 julia time for i in 1:10 sleep 1 end 10.054067 seconds 60 allocations: 3.594 KB julia time parallel for i in 1:10 sleep 1 end 0.195556 seconds 28.91 k allocations: 1.302 MB 1-element Array Future,1 : Future 1,1,8, NULL Greg Plowman Jul 21 and also compare note the sync time sync parallel for i in 1:10 sleep 1 end Also note that using reduction with parallel will also wait: z parallel for i 1:n A end Roger Whitney Jul 22 Instead of using tic toc use time to time your loops. You will find that in your sequential loop you are allocating a lot of memory, while the parallel loop does not. The difference in time is due to the memory allocation. One of my students ran into this earlier this week and that was the cause in his case. My understanding is that the compiler does not optimize for loops done at the top level. When you put the sequential loop in a function the excessive memory goes away, which makes the sequential loop faster. You need to be careful using parallel with no worker process. With no workers the parallel loop can modify globals and you will get the correct result because it is all done in the same process. When you add workers the globals will be copied to each worker and the changes will be done on the workers copy and the result is not copied back to the master process. So code that works with no workers will break when using drugs workers. Ferran Mazzanti Jul 23 Hi Roger, that makes a lot of sense to me... I'll be careful also with globals. Still if the mechanism is the one you mention, there is something fuzzy here as the timmings I posted are right, human-wise, in the sense that the reported times were the ones I actually had to wait in front of my computer to get the result. Shall I understand then that top-level loops are highly unoptimized ? ?? Best, Ferran. On Friday, July 22, 2016 at 2:52:23 PM UTC+2, Roger Whitney wrote: Instead of using tic toc use time to time your loops. You will find that in your sequential loop you are allocating a lot of memory, while the parallel loop does not. The difference in time is due to the memory allocation. One of my students ran into this earlier this week and that was the cause in his case. My understanding is that the compiler does not optimize for loops done at the top level. When you put the sequential loop in a function the excessive memory goes away, which makes the sequential loop faster. You need to be careful using parallel with no worker process. With no workers the parallel loop can modify globals and you will get the correct result because it is all done in the same process. When you add workers the globals will be copied to each worker and the changes will be done on the workers copy and the result is not copied back to the master process. So code that works with no workers will break when using drugs workers.