Groups


43 of 99+  


julia-users ›
Help with parallel computing for optimization.
2 posts by 2 authors  


Nsep 	

Apr 11


Hello World,

I am relatively new with Julia. I wrote an optimization model (MIP) that I need to run many, many time for sensitivity analysis. I am working on a cluster that uses SLURM. I wrote my model as a Julia Module. 

Basically, the process that I want to do is to have a file with all the different cases; and using a "for-loop" I want each case to be solved by a different node (all cores in the node solving the same MIP case). 

I have tried two different ways: (1) using the machinefile option of Julia (see .sh file code below).

#!/bin/bash

#SBATCH --uid=nsep

#SBATCH --job-name="juliaTest"
#SBATCH --partition=newnodes
#SBATCH --output="juliaTest.%j.%N.out"
#SBATCH --error="juliaTest.%j.%N.err"
#SBATCH --time=1:0:0

#SBATCH -N 3
##SBATCH -n 20
#SBATCH --export=ALL

export SLURM_NODEFILE=`generate_pbs_nodefile`

. /etc/profile.d/modules.sh
module add engaging/julia/0.4.3
module add engaging/gurobi/6.5.1

julia --machinefile $SLURM_NODEFILE ~/Cases.jl

Using this method I get an error when loading MyModule (the model)

@everywhere push!(LOAD_PATH, "/home/nsep/Test")

@everywhere using MyModule
@everywhere using DataFrames

@everywhere inpath="/home/nsep/Test/Input"
@everywhere outpath="/home/nsep/Test/Results"
mysetup=Dict() # config. options for MyModule

@everywhere mysetup

casepath="/home/nsep/Test"
@everywhere cases_in_data = readtable("$casepath/Cases_Control.csv", header=true)

@parallel for c in 1:size(cases_in_data,1)
  #loading general inputs
  myinputs = Load_inputs(mysetup,inpath)
  #creating output directory
  mkdir("$outpath/Case$c")
  case_outpath="$outpath/Case$c"
  #case-specific inputs
  myinputs["pMaxCO2"][1]=cases_in_data[:Emissions][c]
  myresults = solve_model(mysetup,myinputs)
  write_outputs(mysetup,case_outpath,myresults,myinputs)
end

The error that I get is:

WARNING: replacing module MyModule
WARNING: replacing module MyModule
WARNING: replacing module MyModule

signal (11): Segmentation fault
jl_module_using at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x2aaaaae0def9)
unknown function (ip: 0x2aaaaae0e1e5)
unknown function (ip: 0x2aaaaae0de3d)
unknown function (ip: 0x2aaaaae0e77c)
jl_load_file_string at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
include_string at loading.jl:266
jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:307
jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x2aaaaadf92a3)
unknown function (ip: 0x2aaaaadf8639)
unknown function (ip: 0x2aaaaae0daac)
jl_toplevel_eval_in at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
eval at ./sysimg.jl:14
jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:1364
jl_f_apply at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
anonymous at multi.jl:910
run_work_thunk at multi.jl:651
run_work_thunk at multi.jl:660
jlcall_run_work_thunk_21367 at  (unknown line)
jl_apply_generic at /cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so (unknown line)
anonymous at task.jl:58
unknown function (ip: 0x2aaaaadff514)
unknown function (ip: (nil))
sh: line 1: 18358 Segmentation fault      /cm/shared/engaging/julia/julia-a2f713dea5/bin/julia --worker
Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
 in read at stream.jl:911
 in message_handler_loop at multi.jl:868
 in process_tcp_streams at multi.jl:857
 in anonymous at task.jl:63
ERROR: LoadError: ProcessExitedException()
 in yieldto at ./task.jl:71
 in wait at ./task.jl:371
 in wait at ./task.jl:286
 in wait at ./channels.jl:63
 in take! at ./channels.jl:53
 in take! at ./multi.jl:809
 in remotecall_fetch at multi.jl:735
 in remotecall_fetch at multi.jl:740
 in anonymous at multi.jl:1386

...and 1 other exceptions.

 in sync_end at ./task.jl:413
 in anonymous at multi.jl:1395
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/nsep/Cases.jl, in expression starting on line 3

(2) The other method I tried was using ClusterManagers.jl, .sh file below.

#!/bin/bash

#SBATCH --uid=nsep

#SBATCH --job-name="juliaTest"
#SBATCH --partition=newnodes
#SBATCH --output="juliaTest.%j.%N.out"
#SBATCH --error="juliaTest.%j.%N.err"
#SBATCH --time=0:2:0

#SBATCH -N 4
#SBATCH --export=ALL

. /etc/profile.d/modules.sh
module add engaging/julia/0.4.3
module add engaging/gurobi/6.5.1

julia ~/julia_cluster.jl

and then in the Julia code I tried to run the SLURM:example in the ClusterManagers page.

using ClusterManagers

# Arguments to the Slurm srun(1) command can be given as keyword
# arguments to addprocs.  The argument name and value is translated to
# a srun(1) command line argument as follows:
# 1) If the length of the argument is 1 => "-arg value",
#    e.g. t="0:1:0" => "-t 0:1:0"
# 2) If the length of the argument is > 1 => "--arg=value"
#    e.g. time="0:1:0" => "--time=0:1:0"
# 3) If the value is the empty string, it becomes a flag value,
#    e.g. exclusive="" => "--exclusive"
# 4) If the argument contains "_", they are replaced with "-",
#    e.g. mem_per_cpu=100 => "--mem-per-cpu=100"

addprocs(SlurmManager(4), partition="newnodes", t="00:2:00")

hosts = []
pids = []
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    push!(hosts, host)
    push!(pids, pid)
end

# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
    rmprocs(i)
end


But I get this error:

Error launching Slurm job:
MethodError(length,(:all_to_all,))

If anyone could help figure out (1) what is wrong in my code when passing MyModule, (2) waht I am doing wrong when trying ClusterManagers, that would be AWESOME!


Jiahao Chen 	

Apr 12


Looks like the bug report ClusterManagers.jl#31. Can you try

Pkg.checkout("ClusterManagers")


and see if that works for you?