Ensuring Each Dask Task Starts on a New SLURM Job with a Limit of 5 Concurrent Jobs

dierkes-j · October 27, 2023, 10:02am

I’m using dask_jobqueue with a SLURM cluster. My goal is to ensure:

Each Dask task starts on a brand-new SLURM job. (e.g. heavy neural network trainings)
A maximum of 5 tasks/jobs are running concurrently at any given moment.

When I submit 10 tasks, Dask respects the concurrency limit of 5 tasks but reuses old SLURM jobs for new tasks. I want each task to be associated with its own fresh SLURM job.

import numpy as np
from dask_jobqueue import SLURMCluster
from dask.distributed import Client, as_completed

def train_config(n_runs):
    rng = np.random.default_rng()
    return rng.standard_normal()

cluster = SLURMCluster(
    cores=2,
    account="xyz",
    memory="8000M",
    walltime="00:30:00",
    job_extra_directives=["--mem-per-cpu=2000M"],
    job_directives_skip=["--mem"],
    local_directory="/work/abc/tmp"
)
cluster.scale(jobs=5)
client = Client(cluster)

# Submitting 10 tasks
futures = [client.submit(train_config, 1) for _ in range(10)]

How can I configure Dask or the SLURMCluster to ensure each task runs on its own fresh SLURM job?

guillaumeeb · October 27, 2023, 5:21pm

Hi @dierkes-j, welcome to dask community,

There is currently no way to have dask-jobqueue start a new job for each task. This is clearly not the design goal of Dask and hence dask-jobqueue.

A need close to this one has been discussed in Restart cluster job on task completion · Issue #597 · dask/dask-jobqueue · GitHub.

There are also other tools like GitHub - facebookincubator/submitit: Python 3.8+ toolbox for submitting jobs to Slurm who might be better suited for this.

dierkes-j · October 27, 2023, 6:55pm

Hi @guillaumeeb,

thanks for your reply! I already thought that this would be the case, but it is good to know for certain. I implemented a similar behavior now with submitit

Thanks again for your help!

Topic		Replies	Views
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	255	November 9, 2023
How to handle job migration of 3rd party tasks? Distributed dask-jobqueue	1	313	November 16, 2022
Memory allocation always <= 4GiB for distributed SLURMCluster workers Distributed dask-jobqueue , worker , distributed	8	732	July 12, 2022
Restarting workers on Slurm Cluster Distributed dask-jobqueue	4	203	September 26, 2023
Parallelisation by multiprocessing not multithreading on SLURMCluster Distributed	1	320	April 23, 2022

Ensuring Each Dask Task Starts on a New SLURM Job with a Limit of 5 Concurrent Jobs

Related topics