Hello. I am using the concurrent futures interface with dask. I have long running tasks that are running in dask ontop of slurm. I am using the adapt interface to scale my workload.
The one thing I don’t like is that dask likes to kill tasks as part of the rebalance?
This is not good as it kills tasks there were running for hours and restarts them again. Is it possible to use adapt() to auto scale up and down as the cluster allows and the workload needed. But at the same time force dask to not kill running tasks, but let them run to completion?
Thanks
Hi @M1Sports20,
Is it occuring upon scaling up or down? If up, the easiest solution is probably to stop work-stealing mechanism. For long running tasks, you might also want to modify the queuing on worker side and set worker-saturation
to 1.0.
However I’m a bit surprised that running tasks would be stopped by that.
Could it come from Slurm job ending?
Thanks for your reply again.
I am sure its not the slurm jobs ending. Most of my slurm logs show a timestamp of 5 mins or fewer when canceling.
Currently, for testing I am only running 12 long running tests. about 4 hours. Yet I end up with 144 slurm log files. I know I need to figure out how its deciding to run what were. Really i just want it to kick off one slurm job per process. someday I will have 10s of thousands of processes that I want to execute.
I will how to enable more logging on the scheduler or just read more in dask. Thanks.
Did you try the suggestions I gave you about work stealing and saturation?
Also, you are using Dask in kind of a edge case. I often suggest to look at submitit for things like that, even if Dask can do it.
I did try the suggestions you gave me and it still seems to be going on, however I don’t have any feedback yet, I haven’t seen whats going on. Interesting about submitit. I will try and look at this more and get back to you.