Let jobs finish when using adapt interface with jobqueues with slurm

M1Sports20 · July 12, 2025, 1:59am

Hello. I am using the concurrent futures interface with dask. I have long running tasks that are running in dask ontop of slurm. I am using the adapt interface to scale my workload.
The one thing I don’t like is that dask likes to kill tasks as part of the rebalance?

This is not good as it kills tasks there were running for hours and restarts them again. Is it possible to use adapt() to auto scale up and down as the cluster allows and the workload needed. But at the same time force dask to not kill running tasks, but let them run to completion?

Thanks

guillaumeeb · July 16, 2025, 12:13pm

Hi @M1Sports20,

Is it occuring upon scaling up or down? If up, the easiest solution is probably to stop work-stealing mechanism. For long running tasks, you might also want to modify the queuing on worker side and set worker-saturation to 1.0.

However I’m a bit surprised that running tasks would be stopped by that.
Could it come from Slurm job ending?

M1Sports20 · July 17, 2025, 8:02pm

Thanks for your reply again.
I am sure its not the slurm jobs ending. Most of my slurm logs show a timestamp of 5 mins or fewer when canceling.

Currently, for testing I am only running 12 long running tests. about 4 hours. Yet I end up with 144 slurm log files. I know I need to figure out how its deciding to run what were. Really i just want it to kick off one slurm job per process. someday I will have 10s of thousands of processes that I want to execute.
I will how to enable more logging on the scheduler or just read more in dask. Thanks.

guillaumeeb · July 18, 2025, 12:09pm

Did you try the suggestions I gave you about work stealing and saturation?

Also, you are using Dask in kind of a edge case. I often suggest to look at submitit for things like that, even if Dask can do it.

M1Sports20 · July 18, 2025, 5:33pm

I did try the suggestions you gave me and it still seems to be going on, however I don’t have any feedback yet, I haven’t seen whats going on. Interesting about submitit. I will try and look at this more and get back to you.

Topic		Replies	Views
Concurrent futures, slurm, and adapt Distributed distributed	2	7	July 11, 2025
Jobqueue : Workers not killed on time Distributed dask-array , distributed	3	356	January 4, 2022
Create a (slurm) cluster with different job submission parameters Deploying Dask dask-jobqueue	15	929	January 19, 2024
Adaptive Scaling while not rerunning non-idempotent tasks Distributed	8	221	December 22, 2023
Scale>1 fails: "shut down workers that don't have promised keys" Distributed	4	601	July 10, 2022

Let jobs finish when using adapt interface with jobqueues with slurm

Related topics