BoundsError() in flush_gc_msgs #6297 Closed carlobaldassi opened this Issue on Mar 28, 2014 · 9 comments Projects None yet Labels bug parallel Milestone No milestone Assignees No one assigned 5 participants @carlobaldassi @amitmurthy @ihnorton @jakebolewski @JeffBezanson Notifications You’re not receiving notifications from this thread. @carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 I have seen occasionally this error in long running jobs: ERROR: BoundsError() in flush_gc_msgs at multi.jl:140 in send_msg_ at multi.jl:164 in remotecall_fetch at multi.jl:672 in sync_end at task.jl:300 Line 140 of multi.jl is: msgs = copy(w.del_msgs) which seems a strange place to throw a BoundsError. To give context, my code uses SharedArrays and the error comes from within a @sync'd for block with @spwanat, something like: @sync for p in ps @spawnat p begin out[p] = update(p, shrd, args) end end where ps is a list of processes, out and shrd are SharedArrays (shared among all ps). I wouln't know how to reproduce though. Reporting for the record and just in case someone can guess what's going on. But I still have a Julia session where it happened open if that can be of any use. Version which was running (with 5 workers): Julia Version 0.3.0-prerelease+2077 Commit 6b9fa29* (2014-03-17 20:45 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: AMD Opteron(tm) Processor 6282 SE WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) LAPACK: libopenblas LIBM: libopenlibm @amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 If your program is doing an adpprocs/rmprocs dynamically, the value of p in out[p] = update(p, shrd, args) could extend beyond length(out) and hence the BoundsError(). The BoundsError() could be the result of the remotecall_fetch, no explanation for the printed stack showing line 140 though. @carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 Unfortunately that's not a possible explanation, I'm not changing the number of processes dynamically. @amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 If you still have the session open could you println(ps), println(length(out)) and println(out) ? @carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 The session is open but I don't have access to those variables since they were local to a function (I removed that from the backtrace). @amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 OK. Anyways, just to be safe, your code should probably be changed to @sync for (i,p) in enumerate(ps) @spawnat p begin out[i] = update(p, shrd, args) end end No? @carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 Yes, that is basically what it actually does. I simplified too much wrt the actual code, sorry. I also didn't specify that that portion of the code runs thousands of times without issues before crashing. @JeffBezanson JeffBezanson added bug parallel labels on Apr 2, 2014 @carlobaldassi The Julia Language member carlobaldassi commented on Apr 12, 2014 So this bug is really killing me now, it happens randomly but given enough time it will show up reliably, and I'm running some long simulations, which means they never reach the end, crashing instead. I have some more data though (12 Mb of data, to be precise), but I'm not sure it's useful. If given directions, I could produce something more detailed (give a few days of the simulation running). For the time being, I put a try...catch block around the call to flush_gc_msgs inside send_msg_ and made it call dump and xdump on all variables: try flush_gc_msgs(w) catch open("dump.txt", "w") do f println(f, "worker:") dump(f, w) println(f, "worker (x):") xdump(f, w) println(f, "kind:") dump(f, kind) println(f, "kind (x):") xdump(f, kind) println(f, "args:") dump(f, args) println(f, "args (x):") xdump(f, args) end rethrow() end The result is collected in this compressed file. More information: the job was running with 1 master and 25 worker processes (all local). I've seen it happening on 2 different machines. The stack trace was similar to the one reported before, however I've seen different traces in other cases, all of them ending with the same 2 calls, send_msg_ and flush_gc_msgs: ERROR: BoundsError() in flush_gc_msgs at multi.jl:140 in send_msg_ at multi.jl:165 in remotecall_fetch at multi.jl:690 in sync_end at task.jl:304 in ... [etc. (my script)] versioninfo: julia> versioninfo() Julia Version 0.3.0-prerelease+2579 Commit 036c6cc* (2014-04-10 07:17 UTC) Platform Info: System: Linux (x86_64-linux-gnu) CPU: AMD Opteron(tm) Processor 6282 SE WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY) LAPACK: libopenblas LIBM: libopenlibm I'm now restarting producing a different output file for each worker, and removing sys.so. @vtjnash vtjnash referenced this issue on Jan 1, 2015 Merged smarter BoundsError reporting #9534 @ihnorton The Julia Language member ihnorton commented on Mar 25, 2015 Still an issue with the new GC? @jakebolewski The Julia Language member jakebolewski commented on Aug 11, 2015 Please reopen if this is still an issue. @jakebolewski jakebolewski closed this on Aug 11, 2015