Prototyping highly available scheduler

I’ve been looking to prototype a replicated/highly-available scheduler with a primary scheduler and a few backups. The idea is that the primary continuously replicate its state to the backups so that in case the primary crashes, a backup scheduler can take over without having to restart the computations. My main question is what data structures from the scheduler-side would be needed to be replicated in order for backups to know which executions have happened in the task graph?

Looking into the source code for the scheduler and the API docs, the scheduler contains the states of all tasks in the “tasks dict”, and a log of all transitions that have happened can be found in the transition log. Are there any other data structures that would be necessary to keep a track of to keep the other schedulers consistent?

Hi @PatrikZhong, welcome to this forum!

Have you taken a look at Resilience · Issue #1072 · dask/distributed · GitHub? Maybe if you really want to implement such a mechanism you could revive the discussion there?

Okay thank you, I’ll try my shot over there!

1 Like