Groups 107 of 99+ julia-users › multiple processes question 3 posts by 2 authors noel ryan Oct 5 I am an undergraduate working on a Julia parallelism project. I have read in quite a few tutorials that to get the best parallel performance I should spawn a number of processes equal to the number of cores in my processor working with 2 cores & 4 threads . However in a test to check processing speeds my result monte carlo test for pi to 1 billion was that using 17 processes calculated the quickest. Adding extra processes above 17 didn't speed up the calculation. Can anyone explain what is happening here? Any help would be great Regards, Noel Chris Rackauckas Oct 5 See this blog post. If your code is perfectly efficient, yes then processes equal to the number of cores so for something like BLAS where it's written as the most efficient threaded algorithms you could image . But for your simple homework assignment? There will be time lost due to inefficiencies. It ends up being much faster to overload the scheduler so that, while one process is being slow due to moving data or something like that, it will kick another one in so that way something is always computing on each core. Even though this will cause some cache misses, if your program is not perfectly efficient, this will win the tradeoff. So while in theory cores threads, you just write efficient code and this choice is best because no cache misses... that's generally not reality. This is the same principle as Amdahl's Law, though that's an iffy explanation since normally that law is in the context of efficiency measured as what percentage of the program is serial vs parallel. Here the efficiency loss is due to the higher-level programming context not being 100 bare-metal efficient, but it's the same idea. Note that your Monte Carlo pi calculation probably is 100 parallel, so it would look like Amdahl's law type things don't apply, but that's only when you abstract and ignore all of the details of computing caching, data movement, etc. I was taught the same thing, yet if you continuously benchmark, only for the most performant and optimal threaded MPI will this be true. It shouldn't be taught anymore: you should just be taught to benchmark you're code. noel ryan Oct 12 Thanks a lot Chris your insight has been extremely helpful