Is there any way how to make scheduler resilient against restarts?

Is there any way how make the scheduler data resilient against restarts? (e.g.: backing up to volume, Redis whatever) As of now when the scheduler restarts the daskJob becomes frozen consuming resources and doing nothing.

Hi @petrsynek, welcome to Dask community!

As stated in the documentation:

The process containing the scheduler might die. There is currently no persistence mechanism to record and recover the scheduler state.
The workers and clients will all reconnect to the scheduler after it comes back online but records of ongoing computations will be lost.

There is some discussion on github about this:

Thanks a lot for the answer, I found the reference in the documentation (I was just wondering whether there is some custom way to do so).

Followup question: As the whole daskJob becomes unstable as a result, would not it be reasonable to mark the job as failed? (Currently, it leaves the daskJob including the whole cluster hanging which is a bug) What would be the correct way to do so in the current version?

I’m not sure of what you mean by that. How Dask is supposed to know a submitted graph is unstable? Dask will mark a computation as failed on some cases, if some task failed to run 3 times in a row for example.

The client should mark the whole job as failed as soon as the TCP connection to the scheduler collapses. This is, deliberately, very resilient to temporary interruptions (the wifi of your laptop where the client is running can die but that won’t shut down the TCP channel).

If your scheduler process is hung but your scheduler host is healthy, then the computation will hang forever (because the Linux kernel is what’s keeping the TCP channel alive).

The scheduler process should not hang, ever. However there are a few known ways to do so - easiest one being to feed it several hundreds of thousands of tasks. Look at your scheduler host’s CPU monitor. If it’s solid at 100%, you have a problem there.

Just for anyone who encounters this, this issue was caused by using Joblib.Parallel with the dask backend.

As @crusaderky notes the dask raises a ConnectionLost error which should cause the termination of the job (unless someone does not wrap into a try-except block which ignores this exception - which joblib unfortunately does)