Jobqueue : Workers not killed on time

outerspacesloth · December 14, 2021, 1:27pm

Hello all,

I’m currently using Dask to perform some data processing with Fourier analysis on a distributed Slurm Cluster

So the work basically consist In basically building a timeserie made of hundreds of HDF5 files into a single dask array. Then fourier transform the dask array, perform some algebra operations on it through matrices operations and finally do a SVD to save the singular values and vectors.

The Slurm Cluster is launched with jobs of 4 processes with 10 cores each and an adaptive scaling between 40 to 80 processes. Each job has a duration of 1h and --lifetime=45min, --stagger=2min.

See code below.

The issue I encounter is that at the end of the lifetime, the dask cluster does not release the workers to be soon killed nor does it generate workers to take the relay. Leading to a crash of the run.

My question being, what could possibly explain the behaviour ?

Best regards

cluster = SLURMCluster(cores=10,
processes=4,
queue = ‘cpu_p1’,
memory=‘140GiB’,
project=‘XXXXX’,
interface=‘ib0’,
walltime=‘01:00:00’,
death_timeout=3600,
job_extra=[“–job-name=dask_worker”,
“–hint=nomultithread”],
env_extra=[“module purge”, “module load python”, “set -x”],
extra=[“–lifetime”, “45m”, “–lifetime-stagger”, “2m”]
)

guillaumeeb · December 15, 2021, 12:48pm

Hi @outerspacesloth,

I think there is an issue in dask-jobqueue quite close to what you mention: cluster.adapt() and persist not working together · Issue #518 · dask/dask-jobqueue · GitHub. I did not have the time to try to reproduce and understand this yet.

Did you find something useful in the workers logs that don’t get killed?

One other small remark: SLURMCluster(cores=10, processes=4) means 4 workers process that will share 4 core, so 2.5 each (not sure if possible ). But it does not mean

jobs of 4 processes with 10 cores

And final and perhaps most important question: do you really need adaptive scaling?
You want to scale from 40 to 80 processes? 40 is quite a big number to keep at minimum. Is this interactive work, or do you just want to have small job walltime to get resources more quickly and renew them? Maybe the easier, if you launch this data processing as a batch, would be to increase the jobs walltime and not to use adapt.

outerspacesloth · December 19, 2021, 4:01pm

Hello,

Thank you for your feedback ! By looking at the workers logs I did not find much information about the crash. But looking at the scheduler logs just indicated to me that the workers could not be released due to memory use.

As you mentioned, I chose to go for a longer batch time and it solved the primary issue about worker release. As for the workers number, the database is quite big (tens of To) and I wanted some speed up

The number of cores was actually a typo, it’s (cores=40, processes=4) in my script.

Cheers and thank you again!

guillaumeeb · January 4, 2022, 3:40pm

Happy to help!

About this part, I just wanted to precise: don’t use adaptive mode if you don’t want worker that restart. It is often simpler to just use a longer walltime, and then scale method:

cluster = SLURMCluster(cores=40,
    processes=4,
    walltime=‘12:00:00’)
cluster.scale(80) # instead of adapt()

This way, the workers will be allocated as soon as ressources become available on your cluster.

Cheers.

Topic		Replies	Views
dask_jobqueue.SLURMCluster: multi-threaded workloads and the effect of setting "cores" Distributed dask-jobqueue , distributed	2	258	November 9, 2023
Memory allocation always <= 4GiB for distributed SLURMCluster workers Distributed dask-jobqueue , worker , distributed	8	733	July 12, 2022
Restarting workers on Slurm Cluster Distributed dask-jobqueue	4	203	September 26, 2023
SLURMCluster: close if close to walltime Deploying Dask	1	254	January 26, 2023
Ensuring Each Dask Task Starts on a New SLURM Job with a Limit of 5 Concurrent Jobs Distributed distributed	2	201	October 27, 2023

Jobqueue : Workers not killed on time

Related topics