worker crashes on pmap with large array on Mac 6567 Closed simonster opened this Issue on Apr 18, 2014 · 28 comments Projects None yet Labels mac parallel Milestone No milestone Assignees No one assigned 5 participants simonster JeffBezanson amitmurthy Keno tknopp Notifications simonster The Julia Language member simonster commented on Apr 18, 2014 julia pmap Vector Float64 zeros 10 9 do x end; julia everywhere 1 Worker 2 terminated. ERROR: stream is closed or unusable in check_open at stream-jl:283 I have 64GB of memory, so this shouldn't be getting killed by the OOM killer or dying trying to allocate. After the first command, the main process is consuming 15GB of memory and the worker is consuming 8GB. I've verified with top that the worker doesn't die until the second command. If I don't have any workers, then this works, but it leaks memory corresponding to the size of the array, and the memory can't be reclaimed by gc . This is not an issue if I use map instead of pmap. julia versioninfo Julia Version 0.3.0-prerelease+2652 Commit 8a5a3fc 2014-04-17 17:20 UTC Platform Info: System: Linux x86_64-linux-gnu CPU: Intel R Core TM i7-3930K CPU 3.20GHz WORD_SIZE: 64 BLAS: libopenblas USE64BITINT DYNAMIC_ARCH NO_AFFINITY LAPACK: libopenblas LIBM: libopenlibm simonster simonster added the parallel label on Apr 18, 2014 simonster The Julia Language member simonster commented on Apr 18, 2014 Could be related to 3934 JeffBezanson The Julia Language member JeffBezanson commented on Apr 19, 2014 The array is 8GB and is getting copied to the other process. simonster The Julia Language member simonster commented on Apr 19, 2014 I realize that, but the system has an ample amount of memory. Simply copying an array that big shouldn't cause the worker to crash, especially since the copy seems to succeed and the crash only happens the next time I try to run something on the worker. simonster The Julia Language member simonster commented on Apr 19, 2014 To be clear, I am only claiming that the behavior with pmap with no workers is a memory leak, since there shouldn't be any references to the array after pmap has completed, but the memory doesn't get freed by gc . I don't actually know if there is a leak when there is a worker, since everywhere gc causes the worker to terminate unexpectedly. amitmurthy The Julia Language member amitmurthy commented on Apr 20, 2014 My machine has 16GB of RAM and this is what I noticed. for i in 1:1000 pmap Vector Float64 zeros 10 8 do x end; everywhere gc everywhere 1 println i end works fine. The processes do take up a large amount of memory, but there is no leak. Just a single instance of 3 times the array size causes the problem you detailed. i.e. pmap Vector Float64 zeros 3 10 8 do x end; everywhere gc results in worker 2 exiting. My explanation for your use case is as follows. On the main process. Arr to be sent : 8GB serialized array in an IOBuffer : 8GB takebuf_array before writing to the socket : 8GB deserialized response the whole array is being returned : 8GB On the worker: Deserialized array: 8GB Serialized array : 8GB takebuf_array before writing to the socket : 8GB That explains 56GB that is not gc'ed during the call - they are all in scope some way or the other. I think it really is a case of system running out memory. A cursory glance shows that we do not check the return value of malloc during memory allocation. Attaching gdb to the worker shows it exiting due to a bad socket , which I suspect is due to memory corruption caused by the above situation. One optimization I can think of is to see if we can avoid the intermediate send buffer IOBuffer , and possibly call gc after a takebuf_array if the array in question is large enough. Note that on my system if I try with 3 10 8 floats, i.e., the failing case, it fails as soon the main RAM is exhausted, i.e., irrespective of the swap space available. simonster The Julia Language member simonster commented on Apr 20, 2014 The function returns nothing, not the input array, so I think you have doubled the memory use in your calculation above. The main process should use 16GB of memory and the worker should use 8GB of memory, which matches what top says above. In any case, on the same system with 64GB of RAM, I can reproduce the bug when sending only 2GB of data: julia pmap Vector Float64 zeros 2 28 do x end; julia everywhere 1 Worker 2 terminated. ERROR: stream is closed or unusable in check_open at stream-jl:283 I'm suspicious that there is silent truncation to int or Int32 happening somewhere; 2 28-4096 works fine. simonster The Julia Language member simonster commented on Apr 20, 2014 It additionally looks kind of like something is holding a reference to the array passed to pmap even after pmap has returned. With no worker, this results in the julia process continuing to hold on to 20GB of memory: julia pmap zeros 2 28-8192 for i 1:10 do x end 10-element Array Any,1 : nothing nothing nothing nothing nothing nothing nothing nothing nothing nothing julia gc while in this case, all memory gets freed: julia a zeros 2 28-8192 for i 1:10 ; pmap a do x end; empty! a 0-element Array Array Float64,1 ,1 julia gc JeffBezanson The Julia Language member JeffBezanson commented on Apr 20, 2014 amitmurthy can you point me to a place where we don't check the return value of malloc? All the functions in gc.c do, but on some platforms malloc never returns NULL. If you look for the commented-out-jl_gc_lookfor in gc.c, you can use that trick in GDB to find the reference to the value also have to add an assert v! lookforme in push_root . When the assertion fails, the backtrace will give you a chain of values pointing to the array hopefully . amitmurthy The Julia Language member amitmurthy commented on Apr 21, 2014 Sorry Jeff, it was late night and my mind was probably seeing stuff that my eyes weren't. The mallocs are OK. And I meant strace, not gdb. Anyways, here is the problematic part of the strace output. .... read 16, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 ..., 65536 65536 read 16, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 ..., 65536 65536 read 16, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 ..., 65536 65536 read 16, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 ..., 65536 6269 epoll_wait 5, , 1024, 0 0 write 16, \2\6result\24\2\177\200 , 13 13 epoll_ctl 5, EPOLL_CTL_MOD, 16, EPOLLIN, u32 16, u64 16 0 epoll_wait 5, , 1024, 0 0 epoll_wait 5, , 1024, 0 0 epoll_wait 5, EPOLLIN, u32 16, u64 16 , 1024, -1 1 read 16, 0x7f6aeb1c6010, 18446744071814643712 -1 EFAULT Bad address .... This happens while sending 3 10 8 floats over. The total number of read bytes of the first call is 36621 65536 + 6269 2400000125, which is correct. The result has also been sent back cleanly. The last read call above which fails is the first call msg sent in context of everywhere gc . The deserialize in multi-jl is requesting a single byte which identifies the msg type . As you can see the length requested to the read call, which should be 65536, under normal circumstances is totally chewed up. Will poke around a but more. amitmurthy The Julia Language member amitmurthy commented on Apr 21, 2014 OK. On gdb now. Breakpoint 3,-jl_uv_alloc_buf handle 0x1ab0270, suggested_size 65536, buf 0x7fffaae8e4c0 at-jl_uv.c:183 183 Breakpoint 2,-jl_uv_readcb handle 0x1ab0270, nread -14, buf 0x7fffaae8e4c0 at-jl_uv.c:177 177 gdb p buf- len $2 18446744071814631040 Actually, after this line buf- len -jl_unbox_int32-jl_t1 ret ; in-jl_uv.c:189, the value of buf- len is way more than an int32. How could that be possible? Any ideas on what I should be looking for? Keno The Julia Language member Keno commented on Apr 21, 2014 Seems like buf- len is a size_t, so maybe we need to make the return value conditional on 32bit vs. 64bit. tknopp tknopp commented on Apr 21, 2014 Doesn't this mean that-jl_unbox_int32-jl_t1 ret is negative and assigned to size_t gets a large value of about 2 64. Keno The Julia Language member Keno commented on Apr 21, 2014 Correct, which probably happened because the calculation overflowed on the Julia side. On 64bit, we really need to pass through 64bit integers. simonster simonster changed the title from pmap crash memory leak with large arrays to pmap crash with large arrays on Apr 21, 2014 simonster simonster referenced this issue on Apr 21, 2014 Open task-related memory leaks 6597 Keno Keno added a commit that referenced this issue on Apr 22, 2014 Keno Fix int32 chokepoint in-jl_uv.c for 6567 ececf77 Keno The Julia Language member Keno commented on Apr 22, 2014 Ok, I fixed the libuv thing which should have hopefully fixed the crash. Can you confirm? amitmurthy The Julia Language member amitmurthy commented on Apr 22, 2014 I can confirm that pmap Vector Float64 zeros 3 10 8 do x end; everywhere gc which used to crash on my machine, works fine now. simonster could you confirm on your setup? simonster The Julia Language member simonster commented on Apr 22, 2014 Unfortunately I have a long-running job that's using most of my memory right now, and I'm afraid of killing it, but I'll try tomorrow. It still does not work on my MacBook Pro, but that may have a different cause, as the worker there dies immediately instead of after the next command: julia pmap Vector Float64 zeros 2 28 do x end; Worker 2 terminated. It's still a 32-bit issue, since 2 28-8192 works. Perhaps related to 6400? simonster simonster added the mac label on Apr 22, 2014 simonster The Julia Language member simonster commented on Apr 22, 2014 Yes, this now works on my Linux system. Leaving open for the Mac issue. simonster simonster changed the title from pmap crash with large arrays to worker crashes on pmap with large array on Mac on Apr 22, 2014 amitmurthy The Julia Language member amitmurthy commented on Apr 23, 2014 What is the amount of RAM on the MacBook? simonster The Julia Language member simonster commented on Apr 23, 2014 16GB amitmurthy The Julia Language member amitmurthy commented on Apr 23, 2014 OK. The reason I asked is because while 2 29 fails on my 16GB Linux machine, 2 29-8192 works. This is because in the former case, the send buffer and the recv buffer on the worker, which have the serialized deserialized request, is grown to 8GB, instead of a little more than 4GB which is required. strace on the worker displays the following: read 16, \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 ..., 65536 65536 mremap 0x7f933fe1f000, 4294971392, 8589938688, MREMAP_MAYMOVE -1 EFAULT Bad address mmap NULL, 8589938688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 -1 ENOMEM Cannot allocate memory brk 0x204bae000 0x4bb6000 mmap NULL, 8590069760, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 -1 ENOMEM Cannot allocate memory So, in effect I need 4GB initial array + 8 GB send buffer + 8GB recv buffer , and I see the above errors. I think the code responsible for this is in-jl_array_grow_end in array.c void-jl_array_grow_end-jl_array_t a, size_t inc if a- isshared && a- how! 3 -jl_error cannot resize array with shared data ; optimized for the case of only growing and shrinking at the end size_t alen -jl_array_nrows a ; if alen + inc a- maxsize - a- offset size_t newlen a- maxsize 0 ? inc 4?4:inc : a- maxsize 2; while alen + inc newlen - a- offset newlen 2; array_resize_buffer a, newlen, alen, a- offset ; ifdef STORE_ARRAY_LEN a- length + inc; endif a- nrows + inc; At least in my case, the newlen 2 results in double the memory usage than actually required. In your case, with 2 28, the RAM required will be 2 + 4 + 4, so, unless the system had other processes taking up large amounts of RAM, it may be a different issue. amitmurthy The Julia Language member amitmurthy commented on Apr 23, 2014 OK, I tried after changing-jl_array_grow_end to only allocate a max of 1MB more than requested. Doing that 2 29 everywhere gc works fine once. A second run results in a crash. Process 1790 attached epoll_wait 5, EPOLLIN, u32 16, u64 16 , 1024, -1 1 read 16, \2\ncall_fetch\24\2\177\204\23\1 \ \ny\273!\320\322\365\n'\26\3: ..., 4295032284 329 epoll_wait 5, , 1024, 0 0 mmap NULL, 4294971392, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 -1 ENOMEM Cannot allocate memory brk 0x1040c7000 0x43c7000 mmap NULL, 4295102464, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 -1 ENOMEM Cannot allocate memory I wonder if uint16_t elsize; uint32_t offset; for 1-d only. does not need to get big. in-jl_array_t julia.h is a probable cause. JeffBezanson The Julia Language member JeffBezanson commented on Apr 23, 2014 Limiting it is good but 1MB is probably too small. We should set these thresholds at startup based on the system memory size. simonster The Julia Language member simonster commented on Apr 23, 2014 The crash on second run could be caused by 6597. Did you check whether the processes were still holding onto memory after the first pmap? amitmurthy The Julia Language member amitmurthy commented on Apr 24, 2014 The processes were still holding onto memory, especially the worker. I added a couple of debug statements just before and the after the blocking call to deserialize the request type. Output below, all the fields are from sock.buffer and edited to only show those from worker 2. julia pmap Vector Float64 zeros 2 29 do x end; From worker 2: Recd request call_fetch pid 2, length data 65536, readable true, writable true, seekable false, append true, size 43776, maxsize 9223372036854775807, ptr 13 From worker 2: Waiting for request... pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 4294967421, maxsize 9223372036854775807, ptr 4294967422 julia everywhere gc From worker 2: Recd request call pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 204, maxsize 9223372036854775807, ptr 2 From worker 2: Waiting for request... pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 204, maxsize 9223372036854775807, ptr 205 From worker 2: Recd request call_fetch pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 34, maxsize 9223372036854775807, ptr 13 From worker 2: Waiting for request... pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 34, maxsize 9223372036854775807, ptr 35 julia pmap Vector Float64 zeros 2 29 do x end; From worker 2: Recd request do pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 28, maxsize 9223372036854775807, ptr 5 From worker 2: Waiting for request... pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 28, maxsize 9223372036854775807, ptr 29 julia everywhere gc fatal error on 2: ERROR: MemoryError in read at io-jl:118 in deserialize at serialize-jl:410 .... From worker 2: Recd request call_fetch pid 2, length data 4295011072, readable true, writable true, seekable false, append true, size 329, maxsize 9223372036854775807, ptr 13 Worker 2 terminated. As can be seen the the socket buffer is never resized downwards, for example during the gc calls. This just results in the worker locking up memory equivalent to the largest buffer it ever recd. amitmurthy The Julia Language member amitmurthy commented on Apr 27, 2014 I am seeing the main process segfault pretty reliably with multiple runs of 2 28 gdb backtrace is gdb backtrace 0 uv_write2 req 0x0, stream 0x3165b10, bufs 0x7fff15b036e0, nbufs 1, send_handle 0x0, cb 0x7f0927846848 -jl_uv_writecb_task at src unix stream.c:1309 1 0x00007f0927846de8 in-jl_write_no_copy stream 0x3165b10, data 0x7f06fbffd010 \002\ncall_fetch\024\002\177\204\023\001 \ \ny\273!\320\322\365\n'\026\003:1\025\061\024\001\177\026\002 1b2\025\061\024\001\201\025\061\024\001 \025\061\024\001\177\025\061\024\001\201b2 \025\061\024\001 \026\001;1\026\001 1 \025\061\024\001 \024\001\025 , n 2147483773, uvw 0x0, writecb 0x7f0927846848 -jl_uv_writecb_task at-jl_uv.c:607 2 0x00007f0925a817de in ?? 3 0x000000000000000a in ?? 4 0x000000000000000c in ?? 5 0x00007fff15b037b8 in ?? 6 0x0000000000000000 in ?? loladiro , anything in particular you want me look at? amitmurthy The Julia Language member amitmurthy commented on Apr 27, 2014 Tried changing uint32_t offset; to a size_t offset; in-jl_array_t julia.h, followed by a make clean; make. Doesn't work, looks like the sizeof of-jl_array_t is specified in some other form elsewhere. What do I need to do to try it out? Keno The Julia Language member Keno commented on Apr 27, 2014 Ah, got it. There's calls to malloc in-jl_uv. They should probably either be checked or go through the managed malloc. This was referenced on Nov 25, 2014 Closed Memory Error when creating large number of workers using DArrray 9149 Closed SharedArray not being garbage collected 9348 Closed references to interpolated values persist in modules 9375 Closed Parallel processes crash when waiting 9497 amitmurthy The Julia Language member amitmurthy commented on Jun 14, 2015 simonster I believe this is not a problem anymore. Please reopen if you still see this. amitmurthy amitmurthy closed this on Jun 14, 2015 Allardvm Allardvm referenced this issue on Oct 14, 2015 Open Remotecall_fetch errors when returning a large array on Windows 13578