Slowdown of a function when executed on worker #6686 Closed carlobaldassi opened this Issue on Apr 29, 2014 · 2 comments Projects None yet Labels parallel performance Milestone No milestone Assignees No one assigned 2 participants @carlobaldassi @malmaud Notifications You’re not receiving notifications from this thread. @carlobaldassi The Julia Language member carlobaldassi commented on Apr 29, 2014 I have a case in which I observe a slowdown of a function when it's executed on a worker rather than in the main process. What I mean is: start with julia -p 1, then write a function like: function inner(args...) # here, `args` do not contain distributed arrays, # remote refs or else, just plain Vectors and # composite types @time begin # code end end then call it either from process 1 or 2: @time remotecall_fetch(1, inner, args...) @time remotecall_fetch(2, inner, args...) Expected result: the outer timings are different because of data movement, but the inner timings are about the same. Actual result: the inner timings are different, and can get 2x-4x slower on the worker. I have a reduced test case with timing results here, this is an excerpt showing the issue: inner on local process ---------------------- elapsed time: 0.589470459 seconds (0 bytes allocated) elapsed time: 0.589635775 seconds (424 bytes allocated) inner on worker --------------- From worker 2: elapsed time: 1.615534087 seconds (0 bytes allocated) elapsed time: 2.12067934 seconds (38298756 bytes allocated) The code in the gist also shows the only solution I found so far to recover the lost performance, which was to manually "open up" the BitArrays and fetch their chunks before the tight-loop computations inside the inner function. The effect may be related to the depth of the nesting in the arguments I'm passing (one of which is a Vector of composite types, each holding a Vector{BitVector}). Is it possible that serialization only goes down to a certain depth? When profiling the guts of inner, I also saw some weird calls to functions in multi.jl occurring when the function was executed by the worker. @carlobaldassi carlobaldassi added performance parallel labels on Apr 29, 2014 @carlobaldassi The Julia Language member carlobaldassi commented on Apr 29, 2014 One additional detail I forgot to mention about the different versions of the inner functions in the gist I linked above: one may suspect that the performance on the remote worker can be recovered just by manually inlining the code for the dot function, but that's not the case: I really need to fetch all BitVector chunks beforehand (outside of the outermost loop). Also, versioninfo: Julia Version 0.3.0-prerelease+2816 Commit d3650a2* (2014-04-29 09:10 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) LAPACK: libopenblas LIBM: libopenlibm @malmaud malmaud commented on Oct 14, 2015 I can't reproduce this on master (b6c0d95) - the worker is reporting identical timing to the master. I did decorate inner and type T with @everywhere declarations, which your gist doesn't have. At least with the current way remotecall_fetch works, those declarations are necessary - not sure if that's what's making difference. Anyways, please reopen if this is still an issue. Here is the revised benchmark I tried: addprocs(1) @everywhere type W J::Vector{BitVector} W(N::Int, K::Int) = new([bitrand(N) for k=1:K]) end function inner(ws::Vector{W}, X::Vector{BitVector}) r=0 @time begin for x in X, w in ws J=w.J s=0 for j in J s += (dot(j,x)>120) end r+= s>50 end end return r end @everywhere function inner_evenslower(ws::Vector{W}, X::Vector{BitVector}) r = 0 @time begin Js = [w.J for w in ws] for x in X, J in Js s = 0 for j in J s += dot(j, x) > 120 end r += s > 50 end end return r end function outer(ws::Vector{W}, X::Vector{BitVector}) remotecall_fetch(inner_evenslower,1,ws,X) remotecall_fetch(inner_evenslower,2,ws,X) remotecall_fetch(inner_evenslower,1,ws,X) remotecall_fetch(inner_evenslower,2,ws,X) end function test() srand(1) N=500 K1=100 K2=200 M=1_000 ws=[W(N,K1) for i=1:K2] X=[bitrand(N) for i=1:M] outer(ws,X) end test() @malmaud malmaud closed this on Oct 14, 2015