segfault when using sparse cholesky factorization in parallel #14355 Closed thraen opened this Issue on Dec 10, 2015 · 8 comments Projects None yet Labels parallel sparse Milestone 0.4.x Assignees No one assigned 6 participants @thraen @KristofferC @andreasnoack @tkelman @kshyatt @ViralBShah Notifications You’re not receiving notifications from this thread. @thraen thraen commented on Dec 10, 2015 I encounter segfaults when trying to use sparse cholesky factorizations in parallel: @everywhere const N=100000 @everywhere const T=1000 const L = speye(N, N) const L_chol = cholfact(L) @show typeof(L_chol) @everywhere function do_slice!(res, L, t) b = rand(N) res[:,T] = L\b end function do_par_chol() res = SharedArray(Float64, (N, T), init= S -> S[localindexes(S)] = 0.0) @sync @parallel for t=1:T do_slice!(res, L_chol, t) end end @time do_par_chol() output: typeof(L_chol) = Base.SparseMatrix.CHOLMOD.Factor{Float64} signal (11): Segmentation fault size at sparse/cholmod.jl:1080 solve at sparse/cholmod.jl:743 do_slice! at fact_par_bug.jl:12 jlcall_do_slice!_21302 at (unknown line) jl_apply_generic at /julia/usr/bin/../lib/libjulia.so (unknown line) anonymous at fact_par_bug.jl:18 anonymous at multi.jl:1353 jl_f_apply at /julia/usr/bin/../lib/libjulia.so (unknown line) anonymous at multi.jl:904 run_work_thunk at multi.jl:645 run_work_thunk at multi.jl:654 jlcall_run_work_thunk_21209 at (unknown line) jl_apply_generic at /julia/usr/bin/../lib/libjulia.so (unknown line) anonymous at task.jl:58 unknown function (ip: 0x7f1c1c0b6f1c) unknown function (ip: (nil)) Worker 7 terminated. ERROR: LoadError: ProcessExitedException() in yieldto at ./task.jl:71 in wait at ./task.jl:371 in wait at ./task.jl:286 in wait at ./channels.jl:63 in take! at ./channels.jl:53 in take! at ./multi.jl:803 in remotecall_fetch at multi.jl:729 in remotecall_fetch at multi.jl:734 in call_on_owner at multi.jl:777 in wait at multi.jl:792 in sync_end at ./task.jl:400 [inlined code] from task.jl:422 in do_par_chol at fact_par_bug.jl:16 in include at ./boot.jl:261 in include_from_node1 at ./loading.jl:304 while loading fact_par_bug.jl, in expression starting on line 155 ERROR (unhandled task failure): readcb: connection reset by peer (ECONNRESET) versioninfo: Julia Version 0.4.3-pre+4 Commit 926513f (2015-12-07 23:47 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: Intel(R) Xeon(R) CPU E7- 8837 @ 2.67GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem) LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-3.3 @KristofferC KristofferC commented on Dec 10, 2015 I thought Suitesparse was not thread safe? @andreasnoack andreasnoack referenced this issue on Dec 10, 2015 Merged Check that pointers to SuiteSparse objects haven't been zeroed before calling SuiteSparse routines. #14149 @andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 See #14149. The segfault has been fixed on master, but the fix hasn't been backported yet. However, it will only change the segfault into an error. You cannot move sparse Cholesky factorizations across workers. @thraen thraen commented on Dec 10, 2015 Ah, I see. So this probably won't work in the foreseeable future. Can I work around this by defining the cholfact everywhere instead of moving it around? @andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 So this probably won't work in the foreseeable future. No. I think it will require that we implement the sparse Cholesky in Julia. It would be a lot of work to match CHOLMOD. Can I work around this by defining the cholfact everywhere instead of moving it around? Yes. That is how you should do this. @thraen thraen commented on Dec 10, 2015 Thanks! @tkelman The Julia Language member tkelman commented on Dec 10, 2015 Should we close this as fixed (error instead of segfault) on master, or wait until 0.4.3 has the fix backported? @andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 I couple of people has hit this lately so let's keep it open for visibility until 0.4.3 is out. @kshyatt kshyatt added parallel sparse labels on Dec 10, 2015 @ViralBShah ViralBShah added this to the 0.4.x milestone on Dec 11, 2015 @andreasnoack The Julia Language member andreasnoack commented on Jan 29 This is now backported @andreasnoack andreasnoack closed this on Jan 29