Groups 73 of 99+ julia-users › Parallel read and preprocess variable-length data files 1 post by 1 author Joshua Jones May 31 I'm looking for pointers on the best practice for parallel reading and preprocessing several files of variable-length time series data. I have little familiarity with parallelization, so it feels like I'm missing something obvious. I've read the Julia documentation and tried several approaches based on it, but with little success. All pids are on one machine. Each file consists of a fixed-length header + NX points of variable precision data NX varies from file to file . NX is stored as a value in each fixed-length header. I've come up with two clumsy solutions; the problem s are described below. Note that, with no parallelization, execution time is 400 of numpy, mostly due to file read speed. pmap file read + preprocess. Somewhat slow due to overhead of moving data; slightly slower than numpy. shared array, parallelize read + downsample. I can achieve a 20 speedup vs. numpy, but I've only found two ways to make this work. A two-pass approach: open each file twice in a pair of pmap-style sync loops. first pass, loop over each file: open, get NX, get timestamp, close, return. NY round Int, sum NX fs_ratio ; xx SharedArray Float64, NY, ; set variables for e.g. time indexing. second pass, loop over each file: open, seek to data start, read data, downsample into shared array, close. A single-pass approach with stat : estimate shared array size by summing stat file .size-header_size smallest_precision over each file. initialize shared array of NaNs, loop over each file to read data, delete the leftover NaNs this runs into memory problems. The lowest precision of some data formats overestimates NX by 8x e.g. Int4 vs Int32 . It seems like the fastest approach by far would be to leave IOStreams open two-pass approach or to read and preprocess each file into its own SharedArray one-pass approach . With the former approach, I don't know how to pass IOStreams between workers. Is this even possible? I've flailed with approaches like an Array IOStream,1 filled with values created in my parallel kernel, but I get Bad file descriptor errors when I try to read from any of the resultant IOStreams again. For any stream, isopen stream returns true, isreadable stream returns true, but the handle is set to 0x00...000. I don't know a workaround, or if one exists. With the latter, if data are stored in a SharedArray created in my parallel kernel, accessing from myid with e.g. sdata s gives undef for every value of s. What I've read suggests this is by design same basic problem as https: github.com JuliaLang julia issues 13802 . What am I doing wrong here? Thanks in advance for any suggestions you can offer. Aside, mostly: is there a documented accepted fastest way to read a signed twos complement 4 bit integer?