increasing -p workers slows program down (not so with --machinefile) #12611 Closed denizyuret opened this Issue on Aug 13, 2015 · 7 comments Projects None yet Labels needs more info parallel performance Milestone No milestone Assignees No one assigned 6 participants @denizyuret @tkelman @yuyichao @amitmurthy @jakebolewski @kshyatt Notifications You’re not receiving notifications from this thread. @denizyuret denizyuret commented on Aug 13, 2015 In a simple parallel program, more workers started with -p seem to slow things down, whereas workers started with --machinefile seem to speed things up. I confirmed this with both v0.3 and v0.4. Here is a parallel program: M = [rand(1000,1000) for i=1:16] @time pmap(svd, M) Here are timing results for local workers on a 16 core machine1: julia -p 2: 14.98 secs julia -p 4: 16.02 secs julia -p 8: 17.64 secs Here are timing results for machine1 connecting to remote workers on same type of machine2: julia --machinefile <2 copies of machine2>: 11.75 secs julia --machinefile <4 copies of machine2>: 7.54 secs julia --machinefile <8 copies of machine2>: 6.46 secs At first I thought things got messed up if the master and the slaves were on the same machine. But it turns out the difference is between -p vs. --machinefile. If I rerun the same test on a single machine, but use --machinefile instead of -p n: julia --machinefile <2 copies of machine1>: 8.41 secs julia --machinefile <4 copies of machine1>: 4.70 secs julia --machinefile <8 copies of machine1>: 3.31 secs @jakebolewski jakebolewski added the parallel label on Aug 13, 2015 @kshyatt kshyatt added the performance label on Aug 13, 2015 @tkelman The Julia Language member tkelman commented on Aug 13, 2015 Can you try modifying blas_set_num_threads, or using a workload that doesn't call into openblas? There's a chance machinefile may be setting the number of threads to 1 per worker by default? @yuyichao The Julia Language member yuyichao commented on Aug 13, 2015 Ref julia-user @denizyuret denizyuret commented on Aug 13, 2015 Setting blas_set_num_threads to 1 did not make a difference (but I was not sure if this is because it is encountered to late). However more importantly using sort! instead of svd still shows the same behavior: M = [rand(1000000) for i=1:16] @time pmap(sort!, M) Gives the following timing results (all workers on the local machine): n -p --machinefile 1 2.23 3.26 2 2.45 2.02 4 2.98 1.44 8 3.97 1.26 16 6.08 1.04 @amitmurthy The Julia Language member amitmurthy commented on Aug 14, 2015 Can you print the contents of your machinefile? @amitmurthy The Julia Language member amitmurthy commented on Aug 14, 2015 Anyone else seeing this? On my local machine, the timings are similar. @amitmurthy The Julia Language member amitmurthy commented on Aug 14, 2015 -p uses LocalManager. and the Julia worker is started as (output of ps) /home/amitm/Work/julia/julia/usr/bin/julia -Cnative -J/home/amitm/Work/julia/julia/usr/lib/julia/sys.so --bind-to 192.168.0.103 --worker --machinefile uses SSHManager and the Julia worker is started as /home/amitm/Work/julia/julia/usr/bin/julia --worker What is the impact of -Cnative and -J flags on performance? @amitmurthy amitmurthy added the needs-more-info label on Jul 21 @amitmurthy The Julia Language member amitmurthy commented on Jul 21 Please reopen if you continue seeing this. @amitmurthy amitmurthy closed this on Jul 21