Slowdown of a function when executed on worker #6686
Closed
carlobaldassi opened this Issue on Apr 29, 2014 · 2 comments
Projects
None yet
Labels
parallel
performance
Milestone
No milestone
Assignees
No one assigned
2 participants
@carlobaldassi
@malmaud
Notifications

You’re not receiving notifications from this thread.
@carlobaldassi
The Julia Language member
carlobaldassi commented on Apr 29, 2014

I have a case in which I observe a slowdown of a function when it's executed on a worker rather than in the main process. What I mean is: start with julia -p 1, then write a function like:

function inner(args...)
    # here, `args` do not contain distributed arrays,
    # remote refs or else, just plain Vectors and
    # composite types
    @time begin
        # code
    end
end

then call it either from process 1 or 2:

@time remotecall_fetch(1, inner, args...)
@time remotecall_fetch(2, inner, args...)

Expected result: the outer timings are different because of data movement, but the inner timings are about the same.
Actual result: the inner timings are different, and can get 2x-4x slower on the worker.

I have a reduced test case with timing results here, this is an excerpt showing the issue:

inner on local process
----------------------
elapsed time: 0.589470459 seconds (0 bytes allocated)
elapsed time: 0.589635775 seconds (424 bytes allocated)

inner on worker
---------------
From worker 2: elapsed time: 1.615534087 seconds (0 bytes allocated)
elapsed time: 2.12067934 seconds (38298756 bytes allocated)

The code in the gist also shows the only solution I found so far to recover the lost performance, which was to manually "open up" the BitArrays and fetch their chunks before the tight-loop computations inside the inner function.

The effect may be related to the depth of the nesting in the arguments I'm passing (one of which is a Vector of composite types, each holding a Vector{BitVector}).

Is it possible that serialization only goes down to a certain depth? When profiling the guts of inner, I also saw some weird calls to functions in multi.jl occurring when the function was executed by the worker.
@carlobaldassi carlobaldassi added performance parallel labels on Apr 29, 2014
@carlobaldassi
The Julia Language member
carlobaldassi commented on Apr 29, 2014

One additional detail I forgot to mention about the different versions of the inner functions in the gist I linked above: one may suspect that the performance on the remote worker can be recovered just by manually inlining the code for the dot function, but that's not the case: I really need to fetch all BitVector chunks beforehand (outside of the outermost loop).

Also, versioninfo:

Julia Version 0.3.0-prerelease+2816
Commit d3650a2* (2014-04-29 09:10 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i5 CPU       M 430  @ 2.27GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

@malmaud
malmaud commented on Oct 14, 2015

I can't reproduce this on master (b6c0d95) - the worker is reporting identical timing to the master. I did decorate inner and type T with @everywhere declarations, which your gist doesn't have. At least with the current way remotecall_fetch works, those declarations are necessary - not sure if that's what's making difference.

Anyways, please reopen if this is still an issue.

Here is the revised benchmark I tried:

addprocs(1)

@everywhere type W
  J::Vector{BitVector}
  W(N::Int, K::Int) = new([bitrand(N) for k=1:K])
end

function inner(ws::Vector{W}, X::Vector{BitVector})
  r=0
  @time begin
    for x in X, w in ws
      J=w.J
      s=0
      for j in J
        s += (dot(j,x)>120)
      end
      r+= s>50
    end
  end
  return r
end

@everywhere function inner_evenslower(ws::Vector{W}, X::Vector{BitVector})
    r = 0
    @time begin
        Js = [w.J for w in ws]
        for x in X, J in Js
            s = 0
            for j in J
                s += dot(j, x) > 120
            end
            r += s > 50
        end
    end
    return r
end

function outer(ws::Vector{W}, X::Vector{BitVector})
  remotecall_fetch(inner_evenslower,1,ws,X)
  remotecall_fetch(inner_evenslower,2,ws,X)
  remotecall_fetch(inner_evenslower,1,ws,X)
  remotecall_fetch(inner_evenslower,2,ws,X)
end

function test()
  srand(1)
  N=500
  K1=100
  K2=200
  M=1_000
  ws=[W(N,K1) for i=1:K2]
  X=[bitrand(N) for i=1:M]
  outer(ws,X)
end

test()

@malmaud malmaud closed this on Oct 14, 2015