Ensuring Each Dask Task Starts on a New SLURM Job with a Limit of 5 Concurrent Jobs

I’m using dask_jobqueue with a SLURM cluster. My goal is to ensure:

  1. Each Dask task starts on a brand-new SLURM job. (e.g. heavy neural network trainings)
  2. A maximum of 5 tasks/jobs are running concurrently at any given moment.

When I submit 10 tasks, Dask respects the concurrency limit of 5 tasks but reuses old SLURM jobs for new tasks. I want each task to be associated with its own fresh SLURM job.

import numpy as np
from dask_jobqueue import SLURMCluster
from dask.distributed import Client, as_completed

def train_config(n_runs):
    rng = np.random.default_rng()
    return rng.standard_normal()

cluster = SLURMCluster(
client = Client(cluster)

# Submitting 10 tasks
futures = [client.submit(train_config, 1) for _ in range(10)]

How can I configure Dask or the SLURMCluster to ensure each task runs on its own fresh SLURM job?

Hi @dierkes-j, welcome to dask community,

There is currently no way to have dask-jobqueue start a new job for each task. This is clearly not the design goal of Dask and hence dask-jobqueue.

A need close to this one has been discussed in Restart cluster job on task completion · Issue #597 · dask/dask-jobqueue · GitHub.

There are also other tools like GitHub - facebookincubator/submitit: Python 3.8+ toolbox for submitting jobs to Slurm who might be better suited for this.

Hi @guillaumeeb,

thanks for your reply! I already thought that this would be the case, but it is good to know for certain. I implemented a similar behavior now with submitit :slight_smile:

Thanks again for your help!

1 Like