segmentation fault when making many remotecalls, possibly related to shared arrays 16880 Closed balents opened this Issue on Jun 11 · 14 comments Projects None yet Labels bug parallel Milestone No milestone Assignees vtjnash vtjnash 5 participants balents vtjnash tkelman kshyatt yuyichao Notifications balents balents commented on Jun 11 I have a parallel code which, when seems to function properly for short runs, but fails for long ones. It is also allocating a large amount of memory. A skeleton that reproduces the error is: const sweeps 50000 const npersweep 1000 everywhere function stuffit! replica::Int64,sharedS::SharedArray,howmany::Int64 nmax size sharedS 1 for count in 1:howmany i rand 1:nmax sharedS i,replica rand end nothing end function runthem! sharedS::SharedArray,theworkers::Array,howmany::Int64 sync begin for replica,w in enumerate theworkers async remotecall_wait w,stuffit!,replica,sharedS,howmany end end end nwork nworkers theworkers workers bigS SharedArray Float64,100,nwork time for steps in 1:sweeps runthem! bigS,theworkers,npersweep end println done. Note that all workers access the same SharedArray, but should always be accessing separate parts. The error trace is julia -p3 trytocrash-jl signal 11 : Segmentation fault: 11 __pool_alloc at Users osx buildbot slave package_osx10_9-x64 build src gc.c:1056 _new_array_ at Users osx buildbot slave package_osx10_9-x64 build src array.c:84 _new_array at Users osx buildbot slave package_osx10_9-x64 build src array.c:337 julia_notify_21185 at unknown line -jlcall_notify_21185 at unknown line -jl_apply at Users osx buildbot slave package_osx10_9-x64 build src . julia.h:1331 uv_readcb at Applications Julia-0.4.5.app Contents Resources julia lib julia sys.dylib unknown line -jlcapi_uv_readcb_19085 at Applications Julia-0.4.5.app Contents Resources julia lib julia sys.dylib unknown line uv__read at Applications Julia-0.4.5.app Contents Resources julia lib julia libjulia.dylib unknown line uv__stream_io at Applications Julia-0.4.5.app Contents Resources julia lib julia libjulia.dylib unknown line uv__io_poll at Applications Julia-0.4.5.app Contents Resources julia lib julia libjulia.dylib unknown line uv_run at Applications Julia-0.4.5.app Contents Resources julia lib julia libjulia.dylib unknown line process_events at Applications Julia-0.4.5.app Contents Resources julia lib julia sys.dylib unknown line wait at Applications Julia-0.4.5.app Contents Resources julia lib julia sys.dylib unknown line wait at Applications Julia-0.4.5.app Contents Resources julia lib julia sys.dylib unknown line wait_readnb at stream-jl:374 read at stream-jl:926 message_handler_loop at multi-jl:878 process_tcp_streams at multi-jl:867-jlcall_process_tcp_streams_21274 at unknown line -jl_apply at Users osx buildbot slave package_osx10_9-x64 build src gf.c:1691 anonymous at task-jl:63-jl_apply at Users osx buildbot slave package_osx10_9-x64 build src . julia.h:1331 Segmentation fault: 11 Version Info: Julia Version 0.4.5 Commit 2ac304d 2016-03-18 00:58 UTC Platform Info: System: Darwin x86_64-apple-darwin13.4.0 CPU: Intel R Core TM i5-4690 CPU 3.50GHz WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell LAPACK: libopenblas64_ LIBM: libopenlibm LLVM: libLLVM-3.3 tkelman tkelman added the parallel label on Jun 11 vtjnash The Julia Language member vtjnash commented on Jun 13 afaict, this works on master, although I can reproduce the failure on v0.4. It looks like metadata for one of the gc pools got corrupted. vtjnash vtjnash added backport pending 0.4 bug labels on Jun 13 balents balents commented on Jun 13 you're right vtjnash! I confirmed it works on the master, and also for my more complicated real code. tkelman The Julia Language member tkelman commented on Jun 14 Would be ideal if you could attempt to do a git bisect to find out what change on master was responsible for the fix. Then we can propose a backport and see if just cherry-picking that single change to release-0.4 fixes it there, or if it would require multiple other changes. balents balents commented on Jun 14 I'm afraid this is way beyond me. No idea what a git bisect is. Sorry. tkelman The Julia Language member tkelman commented on Jun 17 Ah okay, if you're not doing a source build then it's more complicated. We have nightly binaries going back about a month, but if the issue was fixed earlier than that then someone with a source build who can reproduce this will need to take a closer look. kshyatt kshyatt commented on Jun 17 I have a source build of 0.4 and I can try to take a look this evening. kshyatt kshyatt self-assigned this on Jun 17 tkelman The Julia Language member tkelman commented on Jun 17 Since this was fixed somewhere on master, it'll need to be a reverse bisect. Recent enough versions of git now have the useful option of doing git bisect new bug fixed here and git bisect old can reproduce bug to identify when something got fixed. kshyatt kshyatt commented on Jun 18 Bisect claims ea9baec was the fix. yuyichao The Julia Language member yuyichao commented on Jun 18 That commit shouldn't really fix anything other than compiler warnings... yuyichao The Julia Language member yuyichao commented on Jun 18 I can reproduce this locally, I'll see if I can figure out the actual cause more to check if we are just hiding the issue on master yuyichao yuyichao assigned yuyichao and unassigned kshyatt on Jun 18 yuyichao yuyichao removed the backport pending 0.4 label on Jun 18 yuyichao yuyichao removed their assignment on Jun 18 yuyichao The Julia Language member yuyichao commented on Jun 18 OK, I got a very familiar backtrace.... Thread 1 hit Breakpoint 1, pool_alloc p 0x7fd3e1320640 norm_pools+96 at gc.c:1104 1104 return __pool_alloc p, p- osize, p- end_offset ; rr bt 0 pool_alloc p 0x7fd3e1320640 norm_pools+96 at gc.c:1104 1 -jl_gc_allocobj sz sz entry 48 at gc.c:2297 2 0x00007fd3dfedd72f in _new_array_ ndims 1, elsz optimized out , isunboxed optimized out , dims synthetic pointer , atype 0x7fd1da0048b0 at array.c:84 3 _new_array ndims 1, dims synthetic pointer , atype optimized out at array.c:143 4 -jl_alloc_array_1d atype optimized out , nr 0 at array.c:337 5 0x00007fd3dbcee268 in julia___notify 32___18938 at task-jl:296 6 0x00007fd1d9f1606a in julia_notify_21492 7 0x00007fd1d9f16090 in-jlcall_notify_21492 8 0x00007fd3dfe7b9eb in-jl_apply nargs 1, args 0x7ffd0f9b8470, f optimized out at julia.h:1331 9 -jl_apply_generic F 0x7fd1db9737b0, args 0x7ffd0f9b8470, nargs optimized out at gf.c:1684 10 0x00007fd1c85da3a8 in julia_send_del_client_21840 at multi-jl:582 11 0x00007fd1c85da019 in julia_finalize_rr_21839 rr error reading variable: DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjunction with DW_OP_piece or DW_OP_bit_piece. at multi-jl:488 12 0x00007fd3dfe7b9eb in-jl_apply nargs 1, args 0x7ffd0f9b8568, f optimized out at julia.h:1331 13-jl_apply_generic F 0x7fd1dc094750, args 0x7ffd0f9b8568, nargs optimized out at gf.c:1684 14 0x00007fd3dfeef9ef in-jl_apply nargs 1, args 0x7ffd0f9b8568, f 0x7fd1dc094750 at julia.h:1331 15 run_finalizer o 0x7fd1dcc339b0, ff 0x7fd1dc094750 at gc.c:327 16 0x00007fd3dfef71c4 in run_finalizers at gc.c:365 17-jl_gc_collect full full entry 0 at gc.c:2201 18 0x00007fd3dfef8027 in __pool_alloc end_offset 16264, osize 64, p 0x7fd3e1320640 norm_pools+96 at gc.c:1049 19 pool_alloc p 0x7fd3e1320640 norm_pools+96 at gc.c:1104 20-jl_gc_allocobj sz sz entry 48 at gc.c:2297 21 0x00007fd3dfedd72f in _new_array_ ndims 1, elsz optimized out , isunboxed optimized out , dims synthetic pointer , atype 0x7fd1da0048b0 at array.c:84 22 _new_array ndims 1, dims synthetic pointer , atype optimized out at array.c:143 23-jl_alloc_array_1d atype optimized out , nr 0 at array.c:337 24 0x00007fd3dbba1a3b in julia_copy_6514 a optimized out at array-jl:100 25 0x00007fd3dbcf2f04 in julia_flush_gc_msgs_19134 at multi-jl:191 26 0x00007fd1c8608746 in julia_send_msg__21804 now optimized out at multi-jl:225 27 0x00007fd1c860832d in julia_remotecall_wait_21803 f optimized out at multi-jl:761 28 0x00007fd3dfe7b9eb in-jl_apply nargs 5, args 0x7ffd0f9b8b90, f optimized out at julia.h:1331 29-jl_apply_generic F 0x7fd1db7df650, args 0x7ffd0f9b8b90, nargs optimized out at gf.c:1684 30 0x00007fd3dfe840ff in-jl_apply nargs optimized out , args 0x7ffd0f9b8b90, f 0x7fd1db7df650 at julia.h:1331 31-jl_f_apply F optimized out , args optimized out , nargs optimized out at builtins.c:491 32 0x00007fd1c860a103 in julia_remotecall_wait_21802 id 3 '\003', f optimized out at multi-jl:768 33 0x00007fd3dfe7b9eb in-jl_apply nargs 5, args 0x7ffd0f9b8d88, f optimized out at julia.h:1331 34-jl_apply_generic F 0x7fd1db7df650, args 0x7ffd0f9b8d88, nargs optimized out at gf.c:1684 35 0x00007fd1c860c0e8 in julia_anonymous_21801 at task-jl:447 36 0x00007fd3dfedba04 in-jl_apply nargs 0, args 0x0, f optimized out at julia.h:1331 37 start_task at task.c:247 38 0x0000000000000000 in ?? So this is very similar to 16699 and 16204 . In particular, I believe what happens is The call to flush_gc_msgs copys w.del_msgs. A GC is triggerred when allocating the array. The source array grows by finalizers before the memcpy or more importantly sizeof ::Array is called. memcpy is called with the new size of the array, which is larger than the old size and the size of the new array. memcpy corrupts memory. AFAICT the code that triggers this is still there but random changes might have just make it less likely to trigger by this particular code. Restating what's metioned in 16699, I think the right short term solution is to expose the disable finalizer api, which will prevent finalizers to be called on the same thread. Thread synchronization will still use a lock unless there's fancier lockless algorithms of course . In the long term, we can consider always running the finalizers on an idle thread so that both can be done with locks. Tentatively add assignee hoping it can be included in 16204 vtjnash vtjnash was assigned by yuyichao on Jun 18 vtjnash The Julia Language member vtjnash commented on Jun 18 I thought you already committed the interesting bits of 16204 to master? Also, isn't this why you proposed 16893? yuyichao The Julia Language member yuyichao commented on Jun 18 • edited I thought you already committed the interesting bits of 16204 to master? No edit: well, I guess it depends on what do you mean by interesting , I certainly committed the parts I'm interested in ;-p , I only applied the Clanguage api part and used it on inference. Not the part of using it on Dict or other finalizers. Also, isn't this why you proposed 16893? No, 16893 is for write that bypasses write barriers, not for unexpected recursion due to finalizers. yuyichao The Julia Language member yuyichao commented on Jun 18 Not the part of using it on Dict or other finalizers. I can of course try to rebase 16204 after my change. I'm not really familiar with the internal of Dict and remote stuff though... yuyichao yuyichao referenced this issue on Jun 19 Open WIP implement BigInt and BigFloat? with native Array 17015 vtjnash vtjnash added a commit that referenced this issue on Jul 13 vtjnash update w.del_msgs array atomically in flush_gc_msgs 306ae42 vtjnash vtjnash referenced this issue on Jul 13 Merged update w.del_msgs array atomically in flush_gc_msgs 17407 vtjnash vtjnash closed this in 17407 on Jul 14 mfasi mfasi added a commit to mfasi julia that referenced this issue on Sep 5 vtjnash update w.del_msgs array atomically in flush_gc_msgs