Hi all,
Summary
I’m seeing some unexpected behavior on a slurm-managed HPC when I use a SLURMCluster instance to create dask workers with more than 4GB of memory. Each worker is successfully allocated and granted the resources I specify by the HPC, but dask isn’t recognizing the memory requested.
Set-up
Within python, I use dask_jobqueue
to start a slurm cluster instance, which I pass to my client like so:
from dask_jobqueue import SLURMCluster
from dask.distributed import Client
cluster = SLURMCluster(
cores=8,
processes=1,
queue='partition1, partition2',
memory='2GB',
walltime=6:00:00,
job_extra=['--job-name="test"'],
extra=["--lifetime", "55m", "--lifetime-stagger", "4m"]
)
cluster.scale(4)
client = Client(cluster, timeout="1200s")
client.wait_for_workers()
Given this configuration, I would expect each worker to have just under 2GB of memory, and indeed that’s the case.
Issue
However, if I increase memory=
beyond 4GB, I no longer create workers with the memory I’d expect. For instance, this set-up will create workers that all have only 4GB of memory, not 8:
cluster = SLURMCluster(
cores=8,
processes=1,
queue='partition1, partition2',
memory='8GB',
walltime=6:00:00,
job_extra=['--job-name="test"'],
extra=["--lifetime", "55m", "--lifetime-stagger", "4m"]
)
What I’ve tried
From my understanding of dask’s docs, this behavior suggests a hardware or other limit of 4GB that dask would default to when the requested amount isn’t satisfiable. However, this doesn’t appear to be the case, as dask requests and is granted a 8GB job. (I also double-checked with an HPC admin.)
- I can check the dask job script, which looks as I’d expect:
>>> cluster.job_script()
'#!/usr/bin/env bash\n\n#SBATCH -J dask-worker\n#SBATCH -p partition1, partition2\n#SBATCH -n 1\n#SBATCH --cpus-per-task=4\n#SBATCH --mem=30G\n#SBATCH -t 6:00:00\n#SBATCH --job-name="test"\n\n/path/to/bin/python -m distributed.cli.dask_worker tcp://10.19.6.41:36691 --nthreads 4 --memory-limit 7.45GiB --name dummy-name --nanny --death-timeout 60 --lifetime 55m --lifetime-stagger 4m --protocol tcp://\n'
- I can check the processes running on the node and see that the memory limit passed in argument is (~)8GB:
bash /var/spool/slurmd/job56791996/slurm_script
└─python -m distributed.cli.dask_worker tcp://10.18.1.58:44853 --nthreads 4 --memory-limit 7.45GiB --name SLURMCluster-0 --nanny --death-timeout 60 --protocol tcp://
├─python -c from multiprocessing.resource_tracker import main;main(12)
├─python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=13, pipe_handle=19) --multiprocessing-fork
│ └─4*[{python}]
└─4*[{python}]
- I can check the cgroup limits on that node, which are correctly at 8
sh02-08n62 $ cat /sys/fs/cgroup/memory/slurm/uid_297147/job_56791996/memory.limit_in_bytes
8589934592
- And, as expected, if I run this batch script outside of dask, I can create a dask worker with the resources that I’d expect:
$ /path/to/bin/python -m distributed.cli.dask_worker tcp://10.19.6.41:36691 --nthreads 4 --memory-limit 7.45GiB --name dummy-name --nanny --death-timeout 60 --lifetime 55m --lifetime-stagger 4m --protocol tcp://
2022-06-30 14:39:47,184 - distributed.worker - INFO - Start worker at: tcp://171.66.103.162:44442
2022-06-30 14:39:47,184 - distributed.worker - INFO - Listening to: tcp://171.66.103.162:44442
2022-06-30 14:39:47,184 - distributed.worker - INFO - dashboard at: 171.66.103.162:38304
2022-06-30 14:39:47,184 - distributed.worker - INFO - Waiting to connect to: tcp://10.19.6.41:36691
2022-06-30 14:39:47,184 - distributed.worker - INFO - -------------------------------------------------
2022-06-30 14:39:47,184 - distributed.worker - INFO - Threads: 4
2022-06-30 14:39:47,184 - distributed.worker - INFO - Memory: 7.45 GiB
2022-06-30 14:39:47,185 - distributed.worker - INFO - Local Directory: /path/to/dask-worker-space/worker-ajlixlob
2022-06-30 14:39:47,185 - distributed.worker - INFO - -------------------------------------------------
2022-06-30 14:39:48,954 - distributed.worker - INFO - Registered to: tcp://10.19.6.41:36691
2022-06-30 14:39:48,954 - distributed.worker - INFO - -------------------------------------------------
2022-06-30 14:39:48,954 - distributed.core - INFO - Starting established connection
In contrast, every worker I start using a SLURMCluster
instance when more than 4GB is requested looks like
...
2022-MM-DD HH:MM:SS,000 - distributed.worker - INFO - Memory: 4.00 GiB
...
and, as expected, these 4GB workers fail with resource-intensive jobs.
Question
- Anyone know what’s going on here? Why 4GB?
- Any pointers for how I should be setting things up differently?
- Since it seems like the correct resources are being requested for each worker, would a workaround be to change or disable a dask memory configuration?