Workers Not Visible to Each Other on Cluster #9951 Closed JaredCrean2 opened this Issue on Jan 28, 2015 · 4 comments Projects None yet Labels bug parallel Milestone No milestone Assignees No one assigned 3 participants @JaredCrean2 @amitmurthy @ViralBShah Notifications You’re not receiving notifications from this thread. @JaredCrean2 JaredCrean2 commented on Jan 28, 2015 When running Julia on a cluster I encountered a problem with some workers not being visible to others. I created a function to read a list of hostnames from a file and pass them to addprocs(). This completes successfully, but inspecting the output from the workers() shows that some workers do not know about others. Here is the function code: function createProcs2(fname) # get hostnames from file and add them # get hostnames from file f = open(fname,"r") hostnames = readlines(f) # read all lines from file (vector) m = length(hostnames) # add workers addprocs(hostnames) # verify worker_list = workers() num_workers = length(worker_list) # print to log #printToMasterLog("worker list = $worker_list") #printToMasterLog("num_workers = $num_workers") return m end # end function createProcs() Then on each worker I run known_workers = workers() sort!(known_workers) num_workers = length(known_workers) The first 16 workers (corresponding to the first node) all show the expected output: known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] num_workers = 32 But some (not all) of the workers on other nodes show output like this: known workers = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,24,25,26,27,28,29,30,31,32,33] num_workers = 29 Which workers cannot see which other workers varies from run to run, but the workers on the first node consistently are able to see all the other workers. I am running julia version 0.4.0-dev+2698 on an Intel x86_64 cluster using Red Hat Enterprise Linux and SLURM as a job management tool. I tried changing max_parallel = 1, 10, and the number of workers to no avail. Any ideas why this is happening? @amitmurthy The Julia Language member amitmurthy commented on Jan 28, 2015 Could you also provide the commit of your build? julia> versioninfo() Julia Version 0.4.0-dev+2914 Commit 4c3e03b* (2015-01-26 06:17 UTC) .... The parallel stuff has seen some changes in the recent past and it would be good to know how old a build you are running. Also, it will be helpful if you could also print out myid() on each of the workers. Workers with higher pids connect to workers with lower pids (except for pid 1, which initiates connections to all workers). The entire mesh setup may take some time to complete. Do you see the same issue if you run the print statements on workers after some time? @JaredCrean2 JaredCrean2 commented on Jan 28, 2015 Thanks for the quick reply. I added sleep(300) at the end of createProcs2(), and then collected the following output: Julia Version 0.4.0-dev+2698 Commit 0f9b0c6* (2015-01-14 14:36 UTC) Platform Info: System: Linux (x86_64-unknown-linux-gnu) CPU: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge) LAPACK: libopenblas LIBM: libopenlibm LLVM: libLLVM-3.3 nothing From worker 3: myid() = 3, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32 From worker 33: myid() = 33, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17 From worker 21: myid() = 21, worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16 From worker 24: myid() = 24, worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14 From worker 25: myid() = 25, worker_list = [2,3,19,20,21,22,25,27,28,29,30,31,32,33] , num_workers = 14 From worker 2: myid() = 2, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 32 From worker 14: myid() = 14, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 16: myid() = 16, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 8: myid() = 8, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 18: myid() = 18, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 17: myid() = 17, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 11: myid() = 11, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 5: myid() = 5, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 4: myid() = 4, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 13: myid() = 13, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 7: myid() = 7, worker_list = [2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16 From worker 9: myid() = 9, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 6: myid() = 6, worker_list = [2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 16 From worker 10: myid() = 10, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 28: myid() = 28, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17 From worker 26: myid() = 26, worker_list = [2,3,19,20,21,22,26,27,28,29,30,31,32,33] , num_workers = 14 From worker 27: myid() = 27, worker_list = [2,3,19,20,21,22,25,26,27,28,29,30,31,32,33] , num_workers = 15 From worker 19: myid() = 19, worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16 From worker 22: myid() = 22, worker_list = [2,3,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 14 From worker 20: myid() = 20, worker_list = [2,3,19,20,21,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 16 From worker 29: myid() = 29, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,32,33] , num_workers = 15 From worker 23: myid() = 23, worker_list = [2,3,19,20,21,22,23,24,28,29,30,31,32,33] , num_workers = 14 From worker 31: myid() = 31, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16 From worker 15: myid() = 15, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 12: myid() = 12, worker_list = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] , num_workers = 17 From worker 32: myid() = 32, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33] , num_workers = 17 From worker 30: myid() = 30, worker_list = [2,3,19,20,21,22,23,24,25,26,27,28,30,31,32,33] , num_workers = 16 It appears the workers on the first node are indeed affected. @ViralBShah ViralBShah added the parallel label on Jan 28, 2015 @amitmurthy The Julia Language member amitmurthy commented on Jan 28, 2015 I confirm that I see the issue on my laptop too. Will look into it. @amitmurthy amitmurthy added the bug label on Jan 28, 2015 @JaredCrean2 JaredCrean2 commented on Jan 28, 2015 Thanks for looking into it, let me know what you find. @amitmurthy amitmurthy added a commit to amitmurthy/julia that referenced this issue on Jan 29, 2015 @amitmurthy fix bug in worker-to-worker connection setup. closes #9951 473824b @amitmurthy amitmurthy added a commit to amitmurthy/julia that referenced this issue on Jan 29, 2015 @amitmurthy fix bug in worker-to-worker connection setup. closes #9951 afc7aed @amitmurthy amitmurthy added a commit to amitmurthy/julia that referenced this issue on Jan 29, 2015 @amitmurthy fix bug in worker-to-worker connection setup. closes #9951 60d43d5 @amitmurthy amitmurthy referenced this issue on Jan 29, 2015 Merged fix bug in worker-to-worker connection setup. closes #9951 #9953 @amitmurthy amitmurthy closed this in #9953 on Jan 29, 2015 @amitmurthy amitmurthy referenced this issue on Jan 31, 2015 Merged more fixes for worker-worker connection setups #9979