I’m pretty new to Dask and trying to get a tiled geophysical inversion code running across several nodes of our HPC that has a SLURM scheduler.
I have experimented with different setup configurations and can get the code to run, but the memory load and computation time for each worker is significantly larger than anticipated. To try and troubleshoot these issues I created a toy problem to try and get a better idea of how SLURM and Dask-MPI were behaving/interacting.
Here is a copy of my simple python script:
import os
import platform
import numpy as np
import time
import dask_mpi
import dask.distributed as dd
def run(tileID):
tileName = 'tile' + str(tileID)
dd.print('Node:', platform.node(), 'Begin sleeping: ' + tileName + '\n')
time.sleep(20)
dd.print('Finished sleeping: ' + tileName)
# seed random number generator
np.random.seed(1)
# generate random numbers between 0-1
a = np.random.random((10000000, 3))
b = np.random.random((10000000, 3))
c = np.multiply(a, b)
dd.print(tileName + ' shape:', c.shape)
if __name__ == '__main__':
# Initialize dask-mpi to set up remote workers
dask_mpi.core.initialize(local_directory=os.getcwd(), nthreads=1, memory_limit="100 GB", interface="ib0")
print("Initializing Dask-MPI...")
# Setup Client
client = dd.Client()
print('Client information')
print(client)
print('Cluster information')
print(client.cluster)
nTiles = 8
time0 = time.time()
# Run example problem
futures = []
for tile in range(0, nTiles):
print('Tile:', tile)
future = client.submit(run, tile)
futures.append(future)
client.gather(futures)
time1 = time.time()
print('Total job time:' + str(time1-time0))
client.shutdown()
I then run/submit the python script by calling
sbatch Test_DaskMPI_3Node_tpn4_t0.batch
Here is a copy of the batch file
#!/bin/bash
#SBATCH --job-name=Test_DaskMPI_3Node_tpn4_t0
#SBATCH --hint=nomultithread
#SBATCH --mem=100GB
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=3
#SBATCH --exclusive
#SBATCH --reservation=dev
#SBATCH --account=vhp
#SBATCH --time=00:02:00
#SBATCH --output=Test_DaskMPI_3Node_tpn4_t0.log
# module swap PrgEnv-cray PrgEnv-gnu
export OMPI_MCA_btl=^openib
source /home/mamitchell/bin/miniconda3/bin/activate simpeg_main
mpirun -np 12 python -u ./Test_DaskMPI_3Node_tpn4_t0.py | tee ./Test_DaskMPI_3Node_tpn4_t0_output.txt
The hope is that it will evaluate the run
function one time for each tile from 0-7 and that these computations will be split over 3 nodes with each node having up to 4 tasks.
Here is the resulting output from the log file:
> 2024-02-06 19:08:18,207 - distributed.scheduler - INFO - State start
2024-02-06 19:08:18,212 - distributed.scheduler - INFO - Scheduler at: tcp://172.25.1.51:40095
2024-02-06 19:08:18,212 - distributed.scheduler - INFO - dashboard at: http://172.25.1.51:8787/status
2024-02-06 19:08:18,213 - distributed.scheduler - INFO - Registering Worker plugin shuffle
Initializing Dask-MPI...
2024-02-06 19:08:18,295 - distributed.worker - INFO - Start worker at: tcp://172.25.1.126:44111
2024-02-06 19:08:18,295 - distributed.worker - INFO - Listening to: tcp://172.25.1.126:44111
2024-02-06 19:08:18,295 - distributed.worker - INFO - Worker name: 6
2024-02-06 19:08:18,295 - distributed.worker - INFO - dashboard at: 172.25.1.126:43577
2024-02-06 19:08:18,296 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,296 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,296 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,296 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,296 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-gsg57edz
2024-02-06 19:08:18,296 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,296 - distributed.worker - INFO - Start worker at: tcp://172.25.1.51:37273
2024-02-06 19:08:18,296 - distributed.worker - INFO - Listening to: tcp://172.25.1.51:37273
2024-02-06 19:08:18,296 - distributed.worker - INFO - Worker name: 2
2024-02-06 19:08:18,296 - distributed.worker - INFO - dashboard at: 172.25.1.51:46475
2024-02-06 19:08:18,296 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,297 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,296 - distributed.worker - INFO - Start worker at: tcp://172.25.1.126:44001
2024-02-06 19:08:18,296 - distributed.worker - INFO - Listening to: tcp://172.25.1.126:44001
2024-02-06 19:08:18,296 - distributed.worker - INFO - Worker name: 7
2024-02-06 19:08:18,297 - distributed.worker - INFO - dashboard at: 172.25.1.126:41757
2024-02-06 19:08:18,297 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,297 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,297 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,297 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,297 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-9f_6whjt
2024-02-06 19:08:18,297 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,297 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,297 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,297 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-v8la7m69
2024-02-06 19:08:18,297 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,298 - distributed.worker - INFO - Start worker at: tcp://172.25.1.142:37673
2024-02-06 19:08:18,298 - distributed.worker - INFO - Listening to: tcp://172.25.1.142:37673
2024-02-06 19:08:18,298 - distributed.worker - INFO - Worker name: 8
2024-02-06 19:08:18,298 - distributed.worker - INFO - dashboard at: 172.25.1.142:41843
2024-02-06 19:08:18,298 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Start worker at: tcp://172.25.1.51:35759
2024-02-06 19:08:18,298 - distributed.worker - INFO - Start worker at: tcp://172.25.1.142:33697
2024-02-06 19:08:18,299 - distributed.worker - INFO - Listening to: tcp://172.25.1.142:33697
2024-02-06 19:08:18,299 - distributed.worker - INFO - Listening to: tcp://172.25.1.51:35759
2024-02-06 19:08:18,299 - distributed.worker - INFO - Worker name: 3
2024-02-06 19:08:18,299 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,299 - distributed.worker - INFO - Worker name: 11
2024-02-06 19:08:18,299 - distributed.worker - INFO - dashboard at: 172.25.1.51:45673
2024-02-06 19:08:18,299 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,299 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-wkgii4og
2024-02-06 19:08:18,299 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,299 - distributed.worker - INFO - dashboard at: 172.25.1.142:42547
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,299 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Start worker at: tcp://172.25.1.126:43337
2024-02-06 19:08:18,299 - distributed.worker - INFO - Listening to: tcp://172.25.1.126:43337
2024-02-06 19:08:18,299 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,299 - distributed.worker - INFO - Worker name: 5
2024-02-06 19:08:18,299 - distributed.worker - INFO - dashboard at: 172.25.1.126:45521
2024-02-06 19:08:18,299 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,299 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-vbf5s_j3
2024-02-06 19:08:18,299 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,299 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,299 - distributed.worker - INFO - Start worker at: tcp://172.25.1.142:38005
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-if64_68p
2024-02-06 19:08:18,299 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Listening to: tcp://172.25.1.142:38005
2024-02-06 19:08:18,299 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,299 - distributed.worker - INFO - Worker name: 9
2024-02-06 19:08:18,299 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-e6y6fw8l
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - dashboard at: 172.25.1.142:38933
2024-02-06 19:08:18,299 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,299 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,299 - distributed.worker - INFO - Start worker at: tcp://172.25.1.142:44753
2024-02-06 19:08:18,299 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-v_ea1ko_
2024-02-06 19:08:18,299 - distributed.worker - INFO - Listening to: tcp://172.25.1.142:44753
2024-02-06 19:08:18,299 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,299 - distributed.worker - INFO - Worker name: 10
2024-02-06 19:08:18,299 - distributed.worker - INFO - dashboard at: 172.25.1.142:36855
2024-02-06 19:08:18,300 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,300 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,300 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,300 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,300 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-xl1hkheh
2024-02-06 19:08:18,300 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,300 - distributed.worker - INFO - Start worker at: tcp://172.25.1.126:44497
2024-02-06 19:08:18,300 - distributed.worker - INFO - Listening to: tcp://172.25.1.126:44497
2024-02-06 19:08:18,300 - distributed.worker - INFO - Worker name: 4
2024-02-06 19:08:18,300 - distributed.worker - INFO - dashboard at: 172.25.1.126:38063
2024-02-06 19:08:18,300 - distributed.worker - INFO - Waiting to connect to: tcp://172.25.1.51:40095
2024-02-06 19:08:18,300 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,301 - distributed.worker - INFO - Threads: 1
2024-02-06 19:08:18,301 - distributed.worker - INFO - Memory: 93.13 GiB
2024-02-06 19:08:18,301 - distributed.worker - INFO - Local Directory: /caldera/hovenweep/projects/usgs/hazards/vhp/mamitchell/CLVF/Magnetics/DaskTesting/Dask_MPI/dask-scratch-space/worker-35u6hsfz
2024-02-06 19:08:18,301 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:18,877 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.126:44497', name: 4, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,461 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.126:44497
2024-02-06 19:08:19,461 - distributed.core - INFO - Starting established connection to tcp://172.25.1.126:48092
2024-02-06 19:08:19,462 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.126:44111', name: 6, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,462 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,462 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.126:44111
2024-02-06 19:08:19,462 - distributed.core - INFO - Starting established connection to tcp://172.25.1.126:48066
2024-02-06 19:08:19,462 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,463 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,463 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,463 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.142:37673', name: 8, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,463 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.142:37673
2024-02-06 19:08:19,463 - distributed.core - INFO - Starting established connection to tcp://172.25.1.142:60056
2024-02-06 19:08:19,463 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,464 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.126:43337', name: 5, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,464 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,464 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,464 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.126:43337
2024-02-06 19:08:19,464 - distributed.core - INFO - Starting established connection to tcp://172.25.1.126:48080
2024-02-06 19:08:19,464 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,464 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,465 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.142:44753', name: 10, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,465 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,465 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,465 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,465 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.142:44753
2024-02-06 19:08:19,465 - distributed.core - INFO - Starting established connection to tcp://172.25.1.142:60094
2024-02-06 19:08:19,465 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,466 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,466 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,466 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.126:44001', name: 7, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,466 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,466 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.126:44001
2024-02-06 19:08:19,466 - distributed.core - INFO - Starting established connection to tcp://172.25.1.126:48068
2024-02-06 19:08:19,466 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,467 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.142:38005', name: 9, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,467 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.142:38005
2024-02-06 19:08:19,467 - distributed.core - INFO - Starting established connection to tcp://172.25.1.142:60080
2024-02-06 19:08:19,467 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,467 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,467 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,467 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,467 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.142:33697', name: 11, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,468 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.142:33697
2024-02-06 19:08:19,468 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,468 - distributed.core - INFO - Starting established connection to tcp://172.25.1.142:60070
2024-02-06 19:08:19,468 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,468 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,468 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,468 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.51:35759', name: 3, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,468 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,469 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.51:35759
2024-02-06 19:08:19,469 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:60400
2024-02-06 19:08:19,468 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,468 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,469 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,469 - distributed.scheduler - INFO - Register worker <WorkerState 'tcp://172.25.1.51:37273', name: 2, status: init, memory: 0, processing: 0>
2024-02-06 19:08:19,469 - distributed.scheduler - INFO - Starting worker compute stream, tcp://172.25.1.51:37273
2024-02-06 19:08:19,469 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,469 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:60392
2024-02-06 19:08:19,469 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,469 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,470 - distributed.scheduler - INFO - Receive client connection: Client-60d54dba-c555-11ee-88b9-000005c1fe80
2024-02-06 19:08:19,470 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,470 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:60368
2024-02-06 19:08:19,470 - distributed.worker - INFO - Starting Worker plugin shuffle
2024-02-06 19:08:19,470 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,470 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,470 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
2024-02-06 19:08:19,470 - distributed.worker - INFO - Registered to: tcp://172.25.1.51:40095
2024-02-06 19:08:19,471 - distributed.worker - INFO - -------------------------------------------------
2024-02-06 19:08:19,471 - distributed.core - INFO - Starting established connection to tcp://172.25.1.51:40095
Client information
<Client: 'tcp://172.25.1.51:40095' processes=0 threads=0, memory=0 B>
Cluster information
None
Tile: 0
Tile: 1
Tile: 2
Tile: 3
Tile: 4
Tile: 5
Tile: 6
Tile: 7
Node: cn114 Begin sleeping: tile0
Node: cn114 Begin sleeping: tile1
Node:Node: cn114 Begin sleeping: tile0
cn130 Begin sleeping: tile2
Node: cn130 Begin sleeping: tile3
Node: cn114 Begin sleeping: tile1
Node: cn130 Begin sleeping: tile2
Node: cn039 Begin sleeping: tile4
Node: cn114 Begin sleeping: tile5
Node: cn130 Begin sleeping: tile3
Node: cn039 Begin sleeping: tile4
Node: cn130 Begin sleeping: tile6
Node: cn130Node: cn114 Begin sleeping: tile5
Node: cn130 Begin sleeping: tile6
Begin sleeping: tile7
Node: cn130 Begin sleeping: tile7
Finished sleeping: tile0
Finished sleeping: tile0
Finished sleeping: tile1
Finished sleeping: tile1
Finished sleeping: tile2
Finished sleeping: tile3
Finished sleeping: tile2
Finished sleeping: tile4
Finished sleeping: tile3
Finished sleeping: tile4
Finished sleeping: tile5
Finished sleeping: tile5
Finished sleeping: tile6
Finished sleeping: tile7
Finished sleeping: tile6
Finished sleeping: tile7
tile0 shape: (10000000, 3)
tile1 shape: (10000000, 3)
tile0 shape: (10000000, 3)
tile3 shape: (10000000, 3)
tile2 shape: (10000000, 3)
tile1 shape: (10000000, 3)
tile3 shape: (10000000, 3)
tile2 shape: (10000000, 3)
tile5 shape: (10000000, 3)
tile5 shape: (10000000, 3)
tile4 shape: (10000000, 3)
tile4 shape: (10000000, 3)
tile7 shape: (10000000, 3)
tile7 shape: (10000000, 3)
tile6 shape: (10000000, 3)
tile6 shape: (10000000, 3)
Total job time:20.840075492858887
....... Lots of shutting down messages........
I don’t understand why the run
function appears to have been evaluated twice for each of the 8 “tiles”.
The computations were divided across the 3 allocated nodes in the following manner.
Node cn114 ran tiles: 0, 1, 0, 1, 5, 5
Node cn130 ran tiles: 2, 3, 2, 3, 6, 7, 6, 7
Node cn039 ran tiles: 4, 4
From a smaller example with only 2 nodes I thought that the entire script was just being run independently on each node. However, the test shown above using 3 nodes shows that this is not the case since I would have expected each node to run tiles 0-7.
In addition to running the python script using
mpirun -np 12 python ./Test_DaskMPI_3Node_tpn4_t0.py
I have also tried
mpirun -np 12 -npernode 4 python ./Test_DaskMPI_3Node_tpn4_t0.py
and
srun --mpi=pmix python ./Test_DaskMPI_3Node_tpn4_t0.py
All of these variations produce the same results.
The only idea I have left is trying to use the worker_options
within the dask_mpi.initialize
call to specify the number of workers. However, I haven’t been able to find much information about the available options.
I expect that I am doing something silly in the configuration but I’m kind of out of ideas. Any thoughts or suggestions would be greatly appreciated! Thanks.