I’m using dask_jobqueue
with a SLURM cluster. My goal is to ensure:
- Each Dask task starts on a brand-new SLURM job. (e.g. heavy neural network trainings)
- A maximum of 5 tasks/jobs are running concurrently at any given moment.
When I submit 10 tasks, Dask respects the concurrency limit of 5 tasks but reuses old SLURM jobs for new tasks. I want each task to be associated with its own fresh SLURM job.
import numpy as np
from dask_jobqueue import SLURMCluster
from dask.distributed import Client, as_completed
def train_config(n_runs):
rng = np.random.default_rng()
return rng.standard_normal()
cluster = SLURMCluster(
cores=2,
account="xyz",
memory="8000M",
walltime="00:30:00",
job_extra_directives=["--mem-per-cpu=2000M"],
job_directives_skip=["--mem"],
local_directory="/work/abc/tmp"
)
cluster.scale(jobs=5)
client = Client(cluster)
# Submitting 10 tasks
futures = [client.submit(train_config, 1) for _ in range(10)]
How can I configure Dask or the SLURMCluster to ensure each task runs on its own fresh SLURM job?