segmentation fault when making many remotecalls, possibly related to shared arrays #16880 Closed balents opened this Issue on Jun 11 · 14 comments Projects None yet Labels bug parallel Milestone No milestone Assignees @vtjnash vtjnash 5 participants @balents @vtjnash @tkelman @kshyatt @yuyichao Notifications You’re not receiving notifications from this thread. @balents balents commented on Jun 11 I have a parallel code which, when seems to function properly for short runs, but fails for long ones. It is also allocating a large amount of memory. A skeleton that reproduces the error is: const sweeps = 50000 const npersweep = 1000 @everywhere function stuffit!(replica::Int64,sharedS::SharedArray,howmany::Int64) nmax = size(sharedS)[1] for count in 1:howmany i = rand(1:nmax) sharedS[i,replica] = rand() end nothing end function runthem!(sharedS::SharedArray,theworkers::Array,howmany::Int64) @sync begin for (replica,w) in enumerate(theworkers) @async remotecall_wait(w,stuffit!,replica,sharedS,howmany) end end end nwork = nworkers() theworkers = workers() bigS = SharedArray(Float64,100,nwork) @time for steps in 1:sweeps runthem!(bigS,theworkers,npersweep) end println("done.") Note that all workers access the same SharedArray, but should always be accessing separate parts. The error trace is julia -p3 trytocrash.jl signal (11): Segmentation fault: 11 __pool_alloc at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gc.c:1056 _new_array_ at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/array.c:84 _new_array at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/array.c:337 julia_notify_21185 at (unknown line) jlcall_notify_21185 at (unknown line) jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1331 uv_readcb at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line) jlcapi_uv_readcb_19085 at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line) uv__read at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/libjulia.dylib (unknown line) uv__stream_io at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/libjulia.dylib (unknown line) uv__io_poll at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/libjulia.dylib (unknown line) uv_run at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/libjulia.dylib (unknown line) process_events at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line) wait at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line) wait at /Applications/Julia-0.4.5.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line) wait_readnb at stream.jl:374 read at stream.jl:926 message_handler_loop at multi.jl:878 process_tcp_streams at multi.jl:867 jlcall_process_tcp_streams_21274 at (unknown line) jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/gf.c:1691 anonymous at task.jl:63 jl_apply at /Users/osx/buildbot/slave/package_osx10_9-x64/build/src/./julia.h:1331 Segmentation fault: 11 Version Info: Julia Version 0.4.5 Commit 2ac304d (2016-03-18 00:58 UTC) Platform Info: System: Darwin (x86_64-apple-darwin13.4.0) CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-3.3 @tkelman tkelman added the parallel label on Jun 11 @vtjnash The Julia Language member vtjnash commented on Jun 13 afaict, this works on master, although I can reproduce the failure on v0.4. It looks like metadata for one of the gc pools got corrupted. @vtjnash vtjnash added backport pending 0.4 bug labels on Jun 13 @balents balents commented on Jun 13 you're right vtjnash! I confirmed it works on the master, and also for my more complicated real code. @tkelman The Julia Language member tkelman commented on Jun 14 Would be ideal if you could attempt to do a git bisect to find out what change on master was responsible for the fix. Then we can propose a backport and see if just cherry-picking that single change to release-0.4 fixes it there, or if it would require multiple other changes. @balents balents commented on Jun 14 I'm afraid this is way beyond me. No idea what a git bisect is. Sorry. @tkelman The Julia Language member tkelman commented on Jun 17 Ah okay, if you're not doing a source build then it's more complicated. We have nightly binaries going back about a month, but if the issue was fixed earlier than that then someone with a source build who can reproduce this will need to take a closer look. @kshyatt kshyatt commented on Jun 17 I have a source build of 0.4 and I can try to take a look this evening. @kshyatt kshyatt self-assigned this on Jun 17 @tkelman The Julia Language member tkelman commented on Jun 17 Since this was fixed somewhere on master, it'll need to be a reverse bisect. Recent enough versions of git now have the useful option of doing git bisect new (bug fixed here) and git bisect old (can reproduce bug) to identify when something got fixed. @kshyatt kshyatt commented on Jun 18 Bisect claims ea9baec was the fix. @yuyichao The Julia Language member yuyichao commented on Jun 18 That commit shouldn't really fix anything other than compiler warnings... @yuyichao The Julia Language member yuyichao commented on Jun 18 I can reproduce this locally, I'll see if I can figure out the actual cause (more to check if we are just hiding the issue on master) @yuyichao yuyichao assigned yuyichao and unassigned kshyatt on Jun 18 @yuyichao yuyichao removed the backport pending 0.4 label on Jun 18 @yuyichao yuyichao removed their assignment on Jun 18 @yuyichao The Julia Language member yuyichao commented on Jun 18 OK, I got a very familiar backtrace.... Thread 1 hit Breakpoint 1, pool_alloc (p=0x7fd3e1320640 ) at gc.c:1104 1104 return __pool_alloc(p, p->osize, p->end_offset); (rr) bt #0 pool_alloc (p=0x7fd3e1320640 ) at gc.c:1104 #1 jl_gc_allocobj (sz=sz@entry=48) at gc.c:2297 #2 0x00007fd3dfedd72f in _new_array_ (ndims=1, elsz=, isunboxed=, dims=, atype=0x7fd1da0048b0) at array.c:84 #3 _new_array (ndims=1, dims=, atype=) at array.c:143 #4 jl_alloc_array_1d (atype=, nr=0) at array.c:337 #5 0x00007fd3dbcee268 in julia___notify#32___18938 () at task.jl:296 #6 0x00007fd1d9f1606a in julia_notify_21492 () #7 0x00007fd1d9f16090 in jlcall_notify_21492 () #8 0x00007fd3dfe7b9eb in jl_apply (nargs=1, args=0x7ffd0f9b8470, f=) at julia.h:1331 #9 jl_apply_generic (F=0x7fd1db9737b0, args=0x7ffd0f9b8470, nargs=) at gf.c:1684 #10 0x00007fd1c85da3a8 in julia_send_del_client_21840 () at multi.jl:582 #11 0x00007fd1c85da019 in julia_finalize_rr_21839 ( rr=) at multi.jl:488 #12 0x00007fd3dfe7b9eb in jl_apply (nargs=1, args=0x7ffd0f9b8568, f=) at julia.h:1331 #13 jl_apply_generic (F=0x7fd1dc094750, args=0x7ffd0f9b8568, nargs=) at gf.c:1684 #14 0x00007fd3dfeef9ef in jl_apply (nargs=1, args=0x7ffd0f9b8568, f=0x7fd1dc094750) at julia.h:1331 #15 run_finalizer (o=0x7fd1dcc339b0, ff=0x7fd1dc094750) at gc.c:327 #16 0x00007fd3dfef71c4 in run_finalizers () at gc.c:365 #17 jl_gc_collect (full=full@entry=0) at gc.c:2201 #18 0x00007fd3dfef8027 in __pool_alloc (end_offset=16264, osize=64, p=0x7fd3e1320640 ) at gc.c:1049 #19 pool_alloc (p=0x7fd3e1320640 ) at gc.c:1104 #20 jl_gc_allocobj (sz=sz@entry=48) at gc.c:2297 #21 0x00007fd3dfedd72f in _new_array_ (ndims=1, elsz=, isunboxed=, dims=, atype=0x7fd1da0048b0) at array.c:84 #22 _new_array (ndims=1, dims=, atype=) at array.c:143 #23 jl_alloc_array_1d (atype=, nr=0) at array.c:337 #24 0x00007fd3dbba1a3b in julia_copy_6514 (a=) at array.jl:100 #25 0x00007fd3dbcf2f04 in julia_flush_gc_msgs_19134 () at multi.jl:191 #26 0x00007fd1c8608746 in julia_send_msg__21804 (now=) at multi.jl:225 #27 0x00007fd1c860832d in julia_remotecall_wait_21803 (f=) at multi.jl:761 #28 0x00007fd3dfe7b9eb in jl_apply (nargs=5, args=0x7ffd0f9b8b90, f=) at julia.h:1331 #29 jl_apply_generic (F=0x7fd1db7df650, args=0x7ffd0f9b8b90, nargs=) at gf.c:1684 #30 0x00007fd3dfe840ff in jl_apply (nargs=, args=0x7ffd0f9b8b90, f=0x7fd1db7df650) at julia.h:1331 #31 jl_f_apply (F=, args=, nargs=) at builtins.c:491 #32 0x00007fd1c860a103 in julia_remotecall_wait_21802 (id=3 '\003', f=) at multi.jl:768 #33 0x00007fd3dfe7b9eb in jl_apply (nargs=5, args=0x7ffd0f9b8d88, f=) at julia.h:1331 #34 jl_apply_generic (F=0x7fd1db7df650, args=0x7ffd0f9b8d88, nargs=) at gf.c:1684 #35 0x00007fd1c860c0e8 in julia_anonymous_21801 () at task.jl:447 #36 0x00007fd3dfedba04 in jl_apply (nargs=0, args=0x0, f=) at julia.h:1331 #37 start_task () at task.c:247 #38 0x0000000000000000 in ?? () So this is very similar to #16699 and #16204 . In particular, I believe what happens is The call to flush_gc_msgs copys w.del_msgs. A GC is triggerred when allocating the array. The source array grows by finalizers before the memcpy (or more importantly sizeof(::Array)) is called. memcpy is called with the new size of the array, which is larger than the old size and the size of the new array. memcpy corrupts memory. AFAICT the code that triggers this is still there but random changes might have just make it less likely to trigger by this particular code. Restating what's metioned in #16699, I think the right short term solution is to expose the disable finalizer api, which will prevent finalizers to be called on the same thread. Thread synchronization will still use a lock (unless there's fancier lockless algorithms of course). In the long term, we can consider always running the finalizers on an idle thread so that both can be done with locks. Tentatively add assignee hoping it can be included in #16204 =) @vtjnash vtjnash was assigned by yuyichao on Jun 18 @vtjnash The Julia Language member vtjnash commented on Jun 18 I thought you already committed the interesting bits of #16204 to master? Also, isn't this why you proposed #16893? @yuyichao The Julia Language member yuyichao commented on Jun 18 • edited I thought you already committed the interesting bits of #16204 to master? No (edit: well, I guess it depends on what do you mean by "interesting", I certainly committed the parts I'm interested in ;-p ), I only applied the C api part and used it on inference. Not the part of using it on Dict or other finalizers. Also, isn't this why you proposed #16893? No, #16893 is for write that bypasses write barriers, not for unexpected recursion due to finalizers. @yuyichao The Julia Language member yuyichao commented on Jun 18 Not the part of using it on Dict or other finalizers. I can of course try to rebase #16204 after my change. I'm not really familiar with the internal of Dict and remote stuff though... @yuyichao yuyichao referenced this issue on Jun 19 Open [WIP] implement BigInt (and BigFloat?) with native Array #17015 @vtjnash vtjnash added a commit that referenced this issue on Jul 13 @vtjnash update w.del_msgs array atomically in flush_gc_msgs 306ae42 @vtjnash vtjnash referenced this issue on Jul 13 Merged update w.del_msgs array atomically in flush_gc_msgs #17407 @vtjnash vtjnash closed this in #17407 on Jul 14 @mfasi mfasi added a commit to mfasi/julia that referenced this issue on Sep 5 @vtjnash update w.del_msgs array atomically in flush_gc_msgs