I’m currently using Dask to perform some data processing with Fourier analysis on a distributed Slurm Cluster
So the work basically consist In basically building a timeserie made of hundreds of HDF5 files into a single dask array. Then fourier transform the dask array, perform some algebra operations on it through matrices operations and finally do a SVD to save the singular values and vectors.
The Slurm Cluster is launched with jobs of 4 processes with 10 cores each and an adaptive scaling between 40 to 80 processes. Each job has a duration of 1h and --lifetime=45min, --stagger=2min.
See code below.
The issue I encounter is that at the end of the lifetime, the dask cluster does not release the workers to be soon killed nor does it generate workers to take the relay. Leading to a crash of the run.
My question being, what could possibly explain the behaviour ?
cluster = SLURMCluster(cores=10,
queue = ‘cpu_p1’,
env_extra=[“module purge”, “module load python”, “set -x”],
extra=["–lifetime", “45m”, “–lifetime-stagger”, “2m”]
I think there is an issue in dask-jobqueue quite close to what you mention: cluster.adapt() and persist not working together · Issue #518 · dask/dask-jobqueue · GitHub. I did not have the time to try to reproduce and understand this yet.
Did you find something useful in the workers logs that don’t get killed?
One other small remark: SLURMCluster(cores=10, processes=4) means 4 workers process that will share 4 core, so 2.5 each (not sure if possible ). But it does not mean
jobs of 4 processes with 10 cores
And final and perhaps most important question: do you really need adaptive scaling?
You want to scale from 40 to 80 processes? 40 is quite a big number to keep at minimum. Is this interactive work, or do you just want to have small job walltime to get resources more quickly and renew them? Maybe the easier, if you launch this data processing as a batch, would be to increase the jobs walltime and not to use adapt.
Thank you for your feedback ! By looking at the workers logs I did not find much information about the crash. But looking at the scheduler logs just indicated to me that the workers could not be released due to memory use.
As you mentioned, I chose to go for a longer batch time and it solved the primary issue about worker release. As for the workers number, the database is quite big (tens of To) and I wanted some speed up
The number of cores was actually a typo, it’s (cores=40, processes=4) in my script.
Cheers and thank you again!
Happy to help!
About this part, I just wanted to precise: don’t use adaptive mode if you don’t want worker that restart. It is often simpler to just use a longer walltime, and then scale method:
cluster = SLURMCluster(cores=40,
cluster.scale(80) # instead of adapt()
This way, the workers will be allocated as soon as ressources become available on your cluster.