Parallel processes crash when waiting 9497 Closed davidssmith opened this Issue on Dec 30, 2014 ยท 5 comments Projects None yet Labels bug parallel Milestone No milestone Assignees No one assigned 5 participants davidssmith ihnorton ViralBShah amitmurthy jakebolewski Notifications davidssmith davidssmith commented on Dec 30, 2014 I haven't a clue how to go about debugging this, but I thought I'd throw this in, because I'm seeing this fairly regularly on two different machines Mac and Linux . I have a code that reads and writes a large 8 MB array between processes using spawnat and fetch. The workers perform about 16,384 SVDs on different 16x16 matrices. At some point while waiting, one of the workers currently there are four will die with the error fatal error on 2: ERROR: write: invalid argument EINVAL in wait at . task-jl:284 in stream_wait at . stream-jl:263 in write at stream-jl:788 in send_msg_ at multi-jl:178 in send_msg_now at multi-jl:137 in send_msg_now at multi-jl:83 in deliver_result at multi-jl:794 in anonymous at task-jl:856 I have no idea how to go about diagnosing this, because I can't even reproduce it reliably, but I will get you anything you need. Just tell me what to do. P.S. This might be related to 6629. julia versioninfo Julia Version 0.3.4 Commit 3392026 2014-12-26 10:42 UTC Platform Info: System: Darwin x86_64-apple-darwin14.0.0 CPU: Intel R Core TM i7-4870HQ CPU 2.50GHz WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 ihnorton ihnorton added bug parallel labels on Dec 30, 2014 ihnorton The Julia Language member ihnorton commented on Dec 30, 2014 Possibly 6567 ViralBShah The Julia Language member ViralBShah commented on Jan 1, 2015 Could you post the code snippet? Cc: amitmurthy davidssmith davidssmith commented on Jan 2, 2015 And of course now that you ask I'm having trouble reproducing it. It uses a very large input data set that uses most of the RAM, so could it depend on the system state at the time? I think this is definitely related to 6567 and maybe 9149. I was able to fix it by reducing the number of calls to spawnat to just 4 instead of roughly 1024. If I manage to isolate and reproduce it, I will post the code. amitmurthy The Julia Language member amitmurthy commented on Apr 23, 2015 davidssmith could you test again with the latest master? jakebolewski The Julia Language member jakebolewski commented on May 29, 2015 Without a reproducible test case, this is not actionable. davidssmith please open another issue with a way to reproduce the problem if you still see this behavior on master. jakebolewski jakebolewski closed this on May 29, 2015 Allardvm Allardvm referenced this issue on Oct 14, 2015 Open Remotecall_fetch errors when returning a large array on Windows 13578