Groups 67 of 99+ julia-users › confusion with Julia -p 2 8 posts by 3 authors Dan Y Apr 22 I use julia-0.4.5 on windows10 with pentium B970 (2 cores, 2 threads) I have a code with no parallel instructions. When I run it in a standard session with no options my CPU utilization is about 80-90%. But when I run it in a session with "Julia -p 2" the CPU utilization is only about 20-25% . And the code actually works about 3x times slower. I wonder why. I didn't try to make code parallel yet, but I doubt it'll worth it. Stefan Karpinski Apr 22 It's hard to provide useful feedback without more details. - show quoted text - Dan Y Apr 23 Maybe it is because I use matrix multiplication in my code. I guess that base library which is responsible for linear algebra uses multiple cores automatically. But with "-p 2" it is forced to work only on 1 core for some reason. It's only a guess, though. Jason Eckstein Apr 23 Yes, that is correct. If you want to force each worker to use every core for linear algebra operations you need to use the following line: blas_set_num_threads(CPU_CORES) I've done this myself and had parallel julia processes each using more than one core. Make use that line of code is used in each parallel instance being called. - show quoted text - Dan Y Apr 24 Thanks! With blas_set_num_threads(CPU_CORES) the running time is equal. - show quoted text - Dan Y Apr 24 I must add that making code parallel was worth it! In my particular case I've got the following numbers: 21s - parallel code, blas_set_num_threads doesn't affect this much 32s - single-threaded code with blas_set_num_threads(2) 94s - single-threaded code with blas_set_num_threads(1) Also parallel code used significantly lesser amount of allocations - 17MB vs 8.6GB (which looks strange to me, I guess allocations within workers weren't counted). So, while BLAS can utilize multiple cores to achieve decent results, "tame" parallel code is still considerably better. And it is not so hard to write it in Julia) Jason Eckstein Apr 24 Without seeing exactly what you're doing it's hard to say what the optimal setup is, but if you have enough small matrix operations you're doing such that running a single thread would not use the entire CPU to 100%, then running a massively parallel system would be faster even if blas was only using a single thread because you have enough parallel threads to use up the entire CPU. - show quoted text - Dan Y Apr 24 The most funny part - I managed to reduce the running time to 3 seconds, rofl. In my code I use a lot of computations of the dot(A*B*f,f) for matrices A,B and vectors f filled with complex{float64}. But this is the same as dot(B*f,A'*f) !! So, multiplication of matrices was the most time-consuming part. Even though matrices were sparse with sizes around 50x50.