Hi all,
I would like to enforce that each worker only processes one single task at a time. In this case, each worker is a separate node from a HPC cluster with SLURM. The reason for this is that I would like to perform a memory-intensive computation and running two of those tasks in parallel on a single node might exceed the memory limits.
I’m currently using dask-mpi since it allows me to request multiple nodes at once but I didn’t yet manage to make sure that only one task is executed per node. My SLURM file for a batch job contains these entries (among others):
#SBATCH --nodes=5
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=36
The way that I initialize my workers with dask-mpi is as follows:
from dask.distributed import Client
from dask_mpi import initialize
initialize(nthreads=36)
client = Client()
This setup works in principle but will still process 2 tasks per worker when there’s a total of 10 concurrent tasks. I tried to implement the recommendations from here as follows but without any success:
with dask.config.set({"distributed.worker.resources.TEST": 1}):
initialize(nthreads=36)
client = Client()
...
dask.compute(results, optimize_graph=False, resources={"TEST": 1})
Does anyone know how to make this work with dask-mpi? Any help is greatly appreciated!