Problems with scaling pmap beyond 6-7 cores 11354 Closed jasonmorton opened this Issue on May 19, 2015 ยท 4 comments Projects None yet Labels parallel performance Milestone No milestone Assignees No one assigned 5 participants jasonmorton simonster ViralBShah amitmurthy tkelman Notifications jasonmorton jasonmorton commented on May 19, 2015 I'm working with a 16 core 32 thread machine with 32GB ram that presents to ubuntu as 32 cores. I'm trying to understand how to get the best performance for embarrassingly parallel tasks. I want to take a bunch of svds in parallel as an example. The scaling seems to be perfect 6.6 seconds regardless of number of svds until about 7 or 8 simultaneous svds, at which point it starts to creep up, scaling roughly linearly although with high variance, up to 22 seconds for 16 and 47 seconds for 31. I can confirm that the number of processors being used seems to equals the number getting pmapped over by watching htop, so I don't think openblas multithreading is the issue. Memory usage stays low. Any guess on what is going on? I'm using the generic linux binary julia-79599ada44. I don't think there should be any sending of the matrices but perhaps that is the issue. Probably I am missing something obvious. with nprocs 16 time pmap x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:16 elapsed time: 22.350466328 seconds 12292776 bytes allocated time map x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:16 elapsed time: 91.135322511 seconds 10269056672 bytes allocated, 2.57 gc time with nprocs 31 perfect scaling until here at 6x speedup time pmap x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:6 elapsed time: 6.720786336 seconds 159168 bytes allocated time map x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:6 elapsed time: 34.146665292 seconds 3847940044 bytes allocated, 2.46 gc time 4.5x speedup time pmap x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:16 elapsed time: 19.819358972 seconds 391056 bytes allocated time map x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:16 elapsed time: 90.688842475 seconds 10260844684 bytes allocated, 2.36 gc time 3.69x speedup time pmap x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:nprocs elapsed time: 47.411315342 seconds 738616 bytes allocated time map x- svd rand 1000,1000 2 1 for i in 1:10 , i for i in 1:nprocs elapsed time: 175.308752879 seconds 19880206220 bytes allocated, 2.34 gc time tkelman tkelman added the parallel label on May 19, 2015 simonster The Julia Language member simonster commented on May 19, 2015 Same as 10427? simonster simonster added the performance label on May 19, 2015 ViralBShah The Julia Language member ViralBShah commented on May 20, 2015 amitmurthy showed me that even if you just run multiple individual julias using bash and not communicating with each other, you get the same slowdown. Perhaps this does not have to do with julia. amitmurthy The Julia Language member amitmurthy commented on May 20, 2015 Try for i in `seq 1 n` do julia -e blas_set_num_threads 1 ; svd rand 100,100 ; sleep 1.0 ; time svd rand 1000,1000 2 1 for i in 1:10 & done replacing n in seq 1 n with values 1,2,4,8, etc On my 4 core, 8 threads laptop, I get 5.9, 7.4, 11.5 and 32.7 seconds for n of values 1,2,4,and 8 respectively. Ideally I would expect around the same 5.9 seconds for 1,2, and 4 parallel runs since there are actual 4 cores, i.e. ignoring hyperthreading. I suspect L1 L2 cache contention as the cause for the slowdown. Note that they are all independent julia processes running concurrently - julia parallel infrastructure is not used here. jasonmorton jasonmorton commented on May 20, 2015 Yes, this is it. I get the same slowdown with bash, so I think you are right and it is just cache contention. If I change the svd to computing 1000 100x100 svds, the speedup is much better, around 16.5x at 31 threads. I need to think much more carefully about cache in my application. Thanks. jasonmorton jasonmorton closed this on May 20, 2015