Groups 43 of 99+ julia-users › Help with parallel computing for optimization. 2 posts by 2 authors Nsep Apr 11 Hello World, I am relatively new with Julia. I wrote an optimization model (MIP) that I need to run many, many time for sensitivity analysis. I am working on a cluster that uses SLURM. I wrote my model as a Julia Module. Basically, the process that I want to do is to have a file with all the different cases; and using a "for-loop" I want each case to be solved by a different node (all cores in the node solving the same MIP case). I have tried two different ways: (1) using the machinefile option of Julia (see .sh file code below). #!/bin/bash #SBATCH --uid=nsep #SBATCH --job-name="juliaTest" #SBATCH --partition=newnodes #SBATCH --output="juliaTest.%j.%N.out" #SBATCH --error="juliaTest.%j.%N.err" #SBATCH --time=1:0:0 #SBATCH -N 3 ##SBATCH -n 20 #SBATCH --export=ALL export SLURM_NODEFILE=`generate_pbs_nodefile` . /etc/profile.d/modules.sh module add engaging/julia/0.4.3 module add engaging/gurobi/6.5.1 julia --machinefile $SLURM_NODEFILE ~/Cases.jl Using this method I get an error when loading MyModule (the model) @everywhere push!(LOAD_PATH, "/home/nsep/Test") @everywhere using MyModule @everywhere using DataFrames @everywhere inpath="/home/nsep/Test/Input" @everywhere outpath="/home/nsep/Test/Results" mysetup=Dict() # config. options for MyModule @everywhere mysetup casepath="/home/nsep/Test" @everywhere cases_in_data = readtable("$casepath/Cases_Control.csv", header=true) @parallel for c in 1:size(cases_in_data,1) #loading general inputs myinputs = Load_inputs(mysetup,inpath) #creating output directory mkdir("$outpath/Case$c") case_outpath="$outpath/Case$c" #case-specific inputs myinputs["pMaxCO2"][1]=cases_in_data[:Emissions][c] myresults = solve_model(mysetup,myinputs) write_outputs(mysetup,case_outpath,myresults,myinputs) end The error that I get is: WARNING: replacing module MyModule WARNING: replacing module MyModule WARNING: replacing module MyModule signal (11): Segmentation fault jl_module_using at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) unknown function (ip: 0x2aaaaae0def9) unknown function (ip: 0x2aaaaae0e1e5) unknown function (ip: 0x2aaaaae0de3d) unknown function (ip: 0x2aaaaae0e77c) jl_load_file_string at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) include_string at loading.jl:266 jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) include_from_node1 at ./loading.jl:307 jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) unknown function (ip: 0x2aaaaadf92a3) unknown function (ip: 0x2aaaaadf8639) unknown function (ip: 0x2aaaaae0daac) jl_toplevel_eval_in at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) eval at ./sysimg.jl:14 jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) anonymous at multi.jl:1364 jl_f_apply at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) anonymous at multi.jl:910 run_work_thunk at multi.jl:651 run_work_thunk at multi.jl:660 jlcall_run_work_thunk_21367 at (unknown line) jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line) anonymous at task.jl:58 unknown function (ip: 0x2aaaaadff514) unknown function (ip: (nil)) sh: line 1: 18358 Segmentation fault /cm/shared/engaging/julia/julia-a2f713dea5/bin/julia --worker Worker 2 terminated. ERROR (unhandled task failure): EOFError: read end of file in read at stream.jl:911 in message_handler_loop at multi.jl:868 in process_tcp_streams at multi.jl:857 in anonymous at task.jl:63 ERROR: LoadError: ProcessExitedException() in yieldto at ./task.jl:71 in wait at ./task.jl:371 in wait at ./task.jl:286 in wait at ./channels.jl:63 in take! at ./channels.jl:53 in take! at ./multi.jl:809 in remotecall_fetch at multi.jl:735 in remotecall_fetch at multi.jl:740 in anonymous at multi.jl:1386 ...and 1 other exceptions. in sync_end at ./task.jl:413 in anonymous at multi.jl:1395 in include at ./boot.jl:261 in include_from_node1 at ./loading.jl:304 in process_options at ./client.jl:280 in _start at ./client.jl:378 while loading /home/nsep/Cases.jl, in expression starting on line 3 (2) The other method I tried was using ClusterManagers.jl, .sh file below. #!/bin/bash #SBATCH --uid=nsep #SBATCH --job-name="juliaTest" #SBATCH --partition=newnodes #SBATCH --output="juliaTest.%j.%N.out" #SBATCH --error="juliaTest.%j.%N.err" #SBATCH --time=0:2:0 #SBATCH -N 4 #SBATCH --export=ALL . /etc/profile.d/modules.sh module add engaging/julia/0.4.3 module add engaging/gurobi/6.5.1 julia ~/julia_cluster.jl and then in the Julia code I tried to run the SLURM:example in the ClusterManagers page. using ClusterManagers # Arguments to the Slurm srun(1) command can be given as keyword # arguments to addprocs. The argument name and value is translated to # a srun(1) command line argument as follows: # 1) If the length of the argument is 1 => "-arg value", # e.g. t="0:1:0" => "-t 0:1:0" # 2) If the length of the argument is > 1 => "--arg=value" # e.g. time="0:1:0" => "--time=0:1:0" # 3) If the value is the empty string, it becomes a flag value, # e.g. exclusive="" => "--exclusive" # 4) If the argument contains "_", they are replaced with "-", # e.g. mem_per_cpu=100 => "--mem-per-cpu=100" addprocs(SlurmManager(4), partition="newnodes", t="00:2:00") hosts = [] pids = [] for i in workers() host, pid = fetch(@spawnat i (gethostname(), getpid())) push!(hosts, host) push!(pids, pid) end # The Slurm resource allocation is released when all the workers have # exited for i in workers() rmprocs(i) end But I get this error: Error launching Slurm job: MethodError(length,(:all_to_all,)) If anyone could help figure out (1) what is wrong in my code when passing MyModule, (2) waht I am doing wrong when trying ClusterManagers, that would be AWESOME! Jiahao Chen Apr 12 Looks like the bug report ClusterManagers.jl#31. Can you try Pkg.checkout("ClusterManagers") and see if that works for you?