segfault when using sparse cholesky factorization in parallel 14355 Closed thraen opened this Issue on Dec 10, 2015 ยท 8 comments Projects None yet Labels parallel sparse Milestone 0.4.x Assignees No one assigned 6 participants thraen KristofferC andreasnoack tkelman kshyatt ViralBShah Notifications thraen thraen commented on Dec 10, 2015 I encounter segfaults when trying to use sparse cholesky factorizations in parallel: everywhere const N 100000 everywhere const T 1000 const L speye N, N const L_chol cholfact L show typeof L_chol everywhere function do_slice! res, L, t b rand N res :,T L\b end function do_par_chol res SharedArray Float64, N, T , init S - S localindexes S 0.0 sync parallel for t 1:T do_slice! res, L_chol, t end end time do_par_chol output: typeof L_chol Base.SparseMatrix.CHOLMOD.Factor Float64 signal 11 : Segmentation fault size at sparse cholmod-jl:1080 solve at sparse cholmod-jl:743 do_slice! at fact_par_bug-jl:12-jlcall_do_slice!_21302 at unknown line -jl_apply_generic at julia usr bin .. lib libjulia.so unknown line anonymous at fact_par_bug-jl:18 anonymous at multi-jl:1353-jl_f_apply at julia usr bin .. lib libjulia.so unknown line anonymous at multi-jl:904 run_work_thunk at multi-jl:645 run_work_thunk at multi-jl:654-jlcall_run_work_thunk_21209 at unknown line -jl_apply_generic at julia usr bin .. lib libjulia.so unknown line anonymous at task-jl:58 unknown function ip: 0x7f1c1c0b6f1c unknown function ip: nil Worker 7 terminated. ERROR: LoadError: ProcessExitedException in yieldto at . task-jl:71 in wait at . task-jl:371 in wait at . task-jl:286 in wait at . channels-jl:63 in take! at . channels-jl:53 in take! at . multi-jl:803 in remotecall_fetch at multi-jl:729 in remotecall_fetch at multi-jl:734 in call_on_owner at multi-jl:777 in wait at multi-jl:792 in sync_end at . task-jl:400 inlined code from task-jl:422 in do_par_chol at fact_par_bug-jl:16 in include at . boot-jl:261 in include_from_node1 at . loading-jl:304 while loading fact_par_bug-jl, in expression starting on line 155 ERROR unhandled task failure : readcb: connection reset by peer ECONNRESET versioninfo: Julia Version 0.4.3-pre+4 Commit 926513f 2015-12-07 23:47 UTC Platform Info: System: Linux x86_64-linux-gnu CPU: Intel R Xeon R CPU E7- 8837 2.67GHz WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY Nehalem LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-3.3 KristofferC KristofferC commented on Dec 10, 2015 I thought Suitesparse was not thread safe? andreasnoack andreasnoack referenced this issue on Dec 10, 2015 Merged Check that pointers to SuiteSparse objects haven't been zeroed before calling SuiteSparse routines. 14149 andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 See 14149. The segfault has been fixed on master, but the fix hasn't been backported yet. However, it will only change the segfault into an error. You cannot move sparse Cholesky factorizations across workers. thraen thraen commented on Dec 10, 2015 Ah, I see. So this probably won't work in the foreseeable future. Can I work around this by defining the cholfact everywhere instead of moving it around? andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 So this probably won't work in the foreseeable future. No. I think it will require that we implement the sparse Cholesky in Julia. It would be a lot of work to match CHOLMOD. Can I work around this by defining the cholfact everywhere instead of moving it around? Yes. That is how you should do this. thraen thraen commented on Dec 10, 2015 Thanks! tkelman The Julia Language member tkelman commented on Dec 10, 2015 Should we close this as fixed error instead of segfault on master, or wait until 0.4.3 has the fix backported? andreasnoack The Julia Language member andreasnoack commented on Dec 10, 2015 I couple of people has hit this lately so let's keep it open for visibility until 0.4.3 is out. kshyatt kshyatt added parallel sparse labels on Dec 10, 2015 ViralBShah ViralBShah added this to the 0.4.x milestone on Dec 11, 2015 andreasnoack The Julia Language member andreasnoack commented on Jan 29 This is now backported andreasnoack andreasnoack closed this on Jan 29