Race condition when updating cache #14118 Closed eschnett opened this Issue on Nov 24, 2015 · 5 comments Projects None yet Labels parallel precompile Milestone No milestone Assignees No one assigned 5 participants @eschnett @ViralBShah @tkelman @StefanKarpinski @kshyatt Notifications You’re not receiving notifications from this thread. @eschnett eschnett commented on Nov 24, 2015 I am running Julia in parallel, i.e. I am starting several Julia interpreters simultaneously. Usually this works fine. When a package is outdated, both will recompile it. However, I just encountered this error. It went away when I tried again, so I assume it is a race condition. This is with the release branch of Julia 0.4. $ mpirun -np 2 ~/julia/bin/julia 06-cman-transport.jl MPI INFO: Recompiling stale cache file /Users/eschnett/.julia/lib/v0.4/MPI.ji for module MPI. INFO: Recompiling stale cache file /Users/eschnett/.julia/lib/v0.4/MPI.ji for module MPI. ERROR: LoadError: unlink: no such file or directory (ENOENT) in unlink at fs.jl:102 in rm at file.jl:59 in create_expr_cache at loading.jl:330 in recompile_stale at loading.jl:461 in _require_from_serialized at loading.jl:83 in _require_from_serialized at /Users/eschnett/julia/lib/julia/sys.dylib in require at /Users/eschnett/julia/lib/julia/sys.dylib in include at /Users/eschnett/julia/lib/julia/sys.dylib in include_from_node1 at /Users/eschnett/julia/lib/julia/sys.dylib in process_options at /Users/eschnett/julia/lib/julia/sys.dylib in _start at /Users/eschnett/julia/lib/julia/sys.dylib while loading /Users/eschnett/.julia/v0.4/MPI/examples/06-cman-transport.jl, in expression starting on line 1 ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- The standard way to handle this in Unix is to write to a temporary file ($output.tmp), and then use an atomic rename to update the cache (mv("$output.tmp", output)). Removing the file beforehand is not necessary, and can't be done safely unless one locks the directory. @ViralBShah The Julia Language member ViralBShah commented on Nov 24, 2015 I was expecting this one to strike - but not so soon! @ViralBShah ViralBShah added the precompile label on Nov 24, 2015 @kshyatt kshyatt added the parallel label on Nov 24, 2015 @tkelman The Julia Language member tkelman commented on Nov 24, 2015 Essentially a duplicate of #13684. Force a precompile manually if you want to use a package in parallel. We already are creating .ji files atomically, ref #12699. Also ref #12723 which doesn't have any great way of knowing whether other instances of Julia happen to be in the process of creating the same file. @StefanKarpinski The Julia Language member StefanKarpinski commented on Nov 24, 2015 Can we use discretionary file locking? @eschnett eschnett commented on Nov 25, 2015 #13684 is different -- that is about machine-specific caches, whereas this issue here is about an access conflict. I'm not sure what #12699 addresses. The issue here is caused by files not being generated atomically; maybe the solution to #12699 decayed over time? At the moment, this happens sequentially: check whether a file exists if so, delete it open the file, truncating it write to the file The error occurs if the first two actions overlap, since the file can't be deleted twice. The other race -- two processes writing to the same file -- could lead to silent corruption. I'm working on a solution in #14143. @eschnett eschnett referenced this issue on Nov 25, 2015 Merged Avoid race condition when removing cache file #14145 @tkelman The Julia Language member tkelman commented on Nov 29, 2015 closed by #14145 @tkelman tkelman closed this on Nov 29, 2015