Parallel Julia crashes with composite datatype containing SparseMatrixCSC #12848 Closed cschwarzbach opened this Issue on Aug 28, 2015 · 4 comments Projects None yet Labels bug parallel regression Milestone No milestone Assignees No one assigned 6 participants @cschwarzbach @amitmurthy @mattcbro @tkelman @pao @JeffBezanson Notifications You’re not receiving notifications from this thread. @cschwarzbach cschwarzbach commented on Aug 28, 2015 Julia crashes when trying to copy a composite datatype containing two copies of a reference to the same SparseMatrixCSC object from the master process to a parallel worker. This problem occurs for Julia 0.4 starting late June 2015. The following code reproduces the error if running in parallel (julia -p N with N >= 1): @everywhere begin type Two{T} A::T B::T end end p = workers()[1] function send2(A,B) x = Two(A,B) try xref = remotecall_wait(p, identity, x) println("Success") catch println("Failure") end end S = ["dense", "sparse" ] A = Any[zeros(0,0), spzeros(0,0)] B = Any[zeros(0,0), spzeros(0,0)] println() for k = 1:2 println("Testing send2(A, B) with ", S[k], " matrices") send2(A[k], B[k]) println() println("Testing send2(A, A) with ", S[k], " matrices") send2(A[k], A[k]) println() end Screen output from Julia 0.3 and old Julia 0.4 (commit 75432c9*, e.g.): Testing send2(A, B) with dense matrices Success Testing send2(A, A) with dense matrices Success Testing send2(A, B) with sparse matrices Success Testing send2(A, A) with sparse matrices Success Screen output from recent Julia 0.4 (commit f42b222*, e.g.): Testing send2(A, B) with dense matrices Success Testing send2(A, A) with dense matrices Success Testing send2(A, B) with sparse matrices Success Testing send2(A, A) with sparse matrices fatal error on 2: ERROR: KeyError: 3 not found in handle_deserialize at serialize.jl:455 in deserialize at serialize.jl:683 in deserialize_datatype at serialize.jl:636 in handle_deserialize at serialize.jl:457 in deserialize at serialize.jl:429 in anonymous at serialize.jl:472 in ntuple at ./tuple.jl:32 in deserialize_tuple at serialize.jl:472 in handle_deserialize at serialize.jl:450 in deserialize at serialize.jl:683 in deserialize_datatype at serialize.jl:636 in handle_deserialize at serialize.jl:457 in message_handler_loop at multi.jl:844 in process_tcp_streams at multi.jl:833 in anonymous at task.jl:67 Worker 2 terminated.Failure ERROR (unhandled task failure): EOFError: read end of file The Julia and OS version that I'm using is (versioninfo()) Julia Version 0.4.0-dev+7002 Commit f42b222* (2015-08-26 20:27 UTC) Platform Info: System: Darwin (x86_64-apple-darwin14.5.0) CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 @pao pao added bug parallel sparse labels on Aug 28, 2015 @amitmurthy The Julia Language member amitmurthy commented on Sep 1, 2015 I can confirm that this works fine when fields A and B of Two refer to two different spzeros objects, but fails when it refers to the same object. May be related to #12079 (comment) - doesn't look like it, ser/deser works fine over a socket connection. The above comment stands - fails only when the fields of Two refer to the same object. @amitmurthy The Julia Language member amitmurthy commented on Sep 1, 2015 Reduced case: type Two{T} A::T B::T end x=spzeros(0,0) X=Two(x,x) foo = Base.CallWaitMsg(println, (X,), (1,2), (1,3)) io=IOBuffer() serialize(io, foo) seekstart(io) deserialize(io) results in ERROR: KeyError: 3 not found in handle_deserialize at serialize.jl:455 in deserialize at serialize.jl:684 in deserialize_datatype at serialize.jl:637 in handle_deserialize at serialize.jl:457 in deserialize at serialize.jl:429 in anonymous at serialize.jl:472 in ntuple at ./tuple.jl:32 in deserialize_tuple at serialize.jl:472 in handle_deserialize at serialize.jl:450 in deserialize at serialize.jl:684 in deserialize_datatype at serialize.jl:637 in handle_deserialize at serialize.jl:457 in deserialize at serialize.jl:429 in deserialize at serialize.jl:426 CallWaitMsg is defined as type CallWaitMsg <: AbstractMsg f::Function args::Tuple response_oid::Tuple notify_oid::Tuple end @JeffBezanson , do you think 78b999f in the context of SparseMatrix could be the cause? @mattcbro mattcbro commented on Sep 14, 2015 I'm getting the same kind of error without any sparse matrices at all. In particular I have a @parallel for loop that looks something like, Npts = size(deltaq,1) pout = SharedArray(Float64,Npts) xmitantpos = copy(xmitprams.xmitantpos) @sync @parallel for n=1:Npts # theoretically this is a copied version of xmitantpos and so I can write to it #xmitantpos[qq,:] = deltaq[n,:] pout[n] = perfmet(rcons, wgts, xmitprams, xmitprams.xmitantpos) end # extract data from shared array pret = sdata(pout) So xmitprams is a composite type and there are two references to it in the input arguments. However even if I copy it, ie if the last argument in perfmet uses the copy xmitantpos I get a similiar error, namely: ulia> fatal error on 3: fatal error on 6: fatal error on 13: fatal error on 7: fatal error on 10: fatal error on fatal error on ERROR: stack overflow in deserialize at serialize.jl:357 in handle_deserialize at serialize.jl:352 in deserialize at serialize.jl:361 I'll try to get a standalone simpler test case running later. Right now though it's a show stopper since I can't see how to parallelize calls to my function perfmet. The same error occurs if I configure it to use pmap instead. By the way this is happening on 0.3.11 in a linux 64 bit OS. @amitmurthy amitmurthy referenced this issue on Sep 15, 2015 Merged fix tracking of serialization state for Function types and Expr #13134 @JeffBezanson JeffBezanson added the backport pending 0.4 label on Sep 15, 2015 @tkelman The Julia Language member tkelman commented on Sep 16, 2015 given that there's a PR #13134 for this, moving the backport label over there @tkelman tkelman removed the backport pending 0.4 label on Sep 16, 2015 @JeffBezanson JeffBezanson added regression backport pending 0.4 and removed sparse backport pending 0.4 labels on Sep 16, 2015 @JeffBezanson JeffBezanson added a commit that referenced this issue on Sep 16, 2015 @JeffBezanson fix serializing functions with cycles, and a bug in serializing Expr a9ae0ad @JeffBezanson JeffBezanson closed this in #13134 on Sep 16, 2015 @JeffBezanson JeffBezanson added a commit that referenced this issue on Sep 16, 2015 @JeffBezanson fix serializing functions with cycles, and a bug in serializing Expr abba461