Is there any way how to make scheduler resilient against restarts?

petrsynek · September 12, 2023, 2:09pm

Is there any way how make the scheduler data resilient against restarts? (e.g.: backing up to volume, Redis whatever) As of now when the scheduler restarts the daskJob becomes frozen consuming resources and doing nothing.

guillaumeeb · September 12, 2023, 5:35pm

Hi @petrsynek, welcome to Dask community!

As stated in the documentation:

The process containing the scheduler might die. There is currently no persistence mechanism to record and recover the scheduler state.
The workers and clients will all reconnect to the scheduler after it comes back online but records of ongoing computations will be lost.

There is some discussion on github about this:

petrsynek · September 13, 2023, 7:30am

Thanks a lot for the answer, I found the reference in the documentation (I was just wondering whether there is some custom way to do so).

Followup question: As the whole daskJob becomes unstable as a result, would not it be reasonable to mark the job as failed? (Currently, it leaves the daskJob including the whole cluster hanging which is a bug) What would be the correct way to do so in the current version?

guillaumeeb · September 13, 2023, 6:25pm

I’m not sure of what you mean by that. How Dask is supposed to know a submitted graph is unstable? Dask will mark a computation as failed on some cases, if some task failed to run 3 times in a row for example.

crusaderky · September 18, 2023, 10:16am

The client should mark the whole job as failed as soon as the TCP connection to the scheduler collapses. This is, deliberately, very resilient to temporary interruptions (the wifi of your laptop where the client is running can die but that won’t shut down the TCP channel).

If your scheduler process is hung but your scheduler host is healthy, then the computation will hang forever (because the Linux kernel is what’s keeping the TCP channel alive).

The scheduler process should not hang, ever. However there are a few known ways to do so - easiest one being to feed it several hundreds of thousands of tasks. Look at your scheduler host’s CPU monitor. If it’s solid at 100%, you have a problem there.

petrsynek · October 11, 2023, 10:18am

Just for anyone who encounters this, this issue was caused by using Joblib.Parallel with the dask backend.

As @crusaderky notes the dask raises a ConnectionLost error which should cause the termination of the job (unless someone does not wrap into a try-except block which ignores this exception - which joblib unfortunately does)

Topic		Replies	Views
How to make the worker fail the computation when memory limit is reached?	4	5096	March 31, 2022
Prototyping highly available scheduler Distributed scheduler , distributed	2	147	March 2, 2023
Cluster hangs when scheduler and a worker runs on the same machine Distributed distributed	2	306	June 8, 2022
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1190	November 3, 2023
Dask dashboard crashes but client is still running dashboard	7	28	February 7, 2025

Is there any way how to make scheduler resilient against restarts?

Related topics