Parallel processes crash when waiting #9497 Closed davidssmith opened this Issue on Dec 30, 2014 · 5 comments Projects None yet Labels bug parallel Milestone No milestone Assignees No one assigned 5 participants @davidssmith @ihnorton @ViralBShah @amitmurthy @jakebolewski Notifications You’re not receiving notifications from this thread. @davidssmith davidssmith commented on Dec 30, 2014 I haven't a clue how to go about debugging this, but I thought I'd throw this in, because I'm seeing this fairly regularly on two different machines (Mac and Linux). I have a code that reads and writes a large (~8 MB) array between processes using @spawnat and fetch. The workers perform about 16,384 SVDs on different 16x16 matrices. At some point while waiting, one of the workers (currently there are four) will die with the error fatal error on 2: ERROR: write: invalid argument (EINVAL) in wait at ./task.jl:284 in stream_wait at ./stream.jl:263 in write at stream.jl:788 in send_msg_ at multi.jl:178 in send_msg_now at multi.jl:137 in send_msg_now at multi.jl:83 in deliver_result at multi.jl:794 in anonymous at task.jl:856 I have no idea how to go about diagnosing this, because I can't even reproduce it reliably, but I will get you anything you need. Just tell me what to do. P.S. This might be related to #6629. julia> versioninfo() Julia Version 0.3.4 Commit 3392026 (2014-12-26 10:42 UTC) Platform Info: System: Darwin (x86_64-apple-darwin14.0.0) CPU: Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 @ihnorton ihnorton added bug parallel labels on Dec 30, 2014 @ihnorton The Julia Language member ihnorton commented on Dec 30, 2014 Possibly #6567 @ViralBShah The Julia Language member ViralBShah commented on Jan 1, 2015 Could you post the code snippet? Cc: @amitmurthy @davidssmith davidssmith commented on Jan 2, 2015 And of course now that you ask I'm having trouble reproducing it. It uses a very large input data set that uses most of the RAM, so could it depend on the system state at the time? I think this is definitely related to #6567 and maybe #9149. I was able to fix it by reducing the number of calls to @spawnat to just 4 instead of roughly 1024. If I manage to isolate and reproduce it, I will post the code. @amitmurthy The Julia Language member amitmurthy commented on Apr 23, 2015 @davidssmith could you test again with the latest master? @jakebolewski The Julia Language member jakebolewski commented on May 29, 2015 Without a reproducible test case, this is not actionable. @davidssmith please open another issue with a way to reproduce the problem if you still see this behavior on master. @jakebolewski jakebolewski closed this on May 29, 2015 @Allardvm Allardvm referenced this issue on Oct 14, 2015 Open Remotecall_fetch errors when returning a large array on Windows #13578