BoundsError in flush_gc_msgs 6297 Closed carlobaldassi opened this Issue on Mar 28, 2014 ยท 9 comments Projects None yet Labels bug parallel Milestone No milestone Assignees No one assigned 5 participants carlobaldassi amitmurthy ihnorton jakebolewski JeffBezanson Notifications carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 I have seen occasionally this error in long running jobs: ERROR: BoundsError in flush_gc_msgs at multi-jl:140 in send_msg_ at multi-jl:164 in remotecall_fetch at multi-jl:672 in sync_end at task-jl:300 Line 140 of multi-jl is: msgs copy w.del_msgs which seems a strange place to throw a BoundsError. To give context, my code uses SharedArrays and the error comes from within a sync'd for block with spwanat, something like: sync for p in ps spawnat p begin out p update p, shrd, args end end where ps is a list of processes, out and shrd are SharedArrays shared among all ps . I wouln't know how to reproduce though. Reporting for the record and just in case someone can guess what's going on. But I still have a Julia session where it happened open if that can be of any use. Version which was running with 5 workers : Julia Version 0.3.0-prerelease+2077 Commit 6b9fa29 2014-03-17 20:45 UTC Platform Info: System: Linux x86_64-linux-gnu CPU: AMD Opteron tm Processor 6282 SE WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY LAPACK: libopenblas LIBM: libopenlibm amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 If your program is doing an adpprocs rmprocs dynamically, the value of p in out p update p, shrd, args could extend beyond length out and hence the BoundsError . The BoundsError could be the result of the remotecall_fetch, no explanation for the printed stack showing line 140 though. carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 Unfortunately that's not a possible explanation, I'm not changing the number of processes dynamically. amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 If you still have the session open could you println ps , println length out and println out ? carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 The session is open but I don't have access to those variables since they were local to a function I removed that from the backtrace . amitmurthy The Julia Language member amitmurthy commented on Mar 28, 2014 OK. Anyways, just to be safe, your code should probably be changed to sync for i,p in enumerate ps spawnat p begin out i update p, shrd, args end end No? carlobaldassi The Julia Language member carlobaldassi commented on Mar 28, 2014 Yes, that is basically what it actually does. I simplified too much wrt the actual code, sorry. I also didn't specify that that portion of the code runs thousands of times without issues before crashing. JeffBezanson JeffBezanson added bug parallel labels on Apr 2, 2014 carlobaldassi The Julia Language member carlobaldassi commented on Apr 12, 2014 So this bug is really killing me now, it happens randomly but given enough time it will show up reliably, and I'm running some long simulations, which means they never reach the end, crashing instead. I have some more data though 12 Mb of data, to be precise , but I'm not sure it's useful. If given directions, I could produce something more detailed give a few days of the simulation running . For the time being, I put a try...catch block around the call to flush_gc_msgs inside send_msg_ and made it call dump and xdump on all variables: try flush_gc_msgs w catch open dump.txt , w do f println f, worker: dump f, w println f, worker x : xdump f, w println f, kind: dump f, kind println f, kind x : xdump f, kind println f, args: dump f, args println f, args x : xdump f, args end rethrow end The result is collected in this compressed file. More information: the job was running with 1 master and 25 worker processes all local . I've seen it happening on 2 different machines. The stack trace was similar to the one reported before, however I've seen different traces in other cases, all of them ending with the same 2 calls, send_msg_ and flush_gc_msgs: ERROR: BoundsError in flush_gc_msgs at multi-jl:140 in send_msg_ at multi-jl:165 in remotecall_fetch at multi-jl:690 in sync_end at task-jl:304 in ... etc. my script versioninfo: julia versioninfo Julia Version 0.3.0-prerelease+2579 Commit 036c6cc 2014-04-10 07:17 UTC Platform Info: System: Linux x86_64-linux-gnu CPU: AMD Opteron tm Processor 6282 SE WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY LAPACK: libopenblas LIBM: libopenlibm I'm now restarting producing a different output file for each worker, and removing sys.so. vtjnash vtjnash referenced this issue on Jan 1, 2015 Merged smarter BoundsError reporting 9534 ihnorton The Julia Language member ihnorton commented on Mar 25, 2015 Still an issue with the new GC? jakebolewski The Julia Language member jakebolewski commented on Aug 11, 2015 Please reopen if this is still an issue. jakebolewski jakebolewski closed this on Aug 11, 2015