Prototyping highly available scheduler

PatrikZhong · March 2, 2023, 8:53am

I’ve been looking to prototype a replicated/highly-available scheduler with a primary scheduler and a few backups. The idea is that the primary continuously replicate its state to the backups so that in case the primary crashes, a backup scheduler can take over without having to restart the computations. My main question is what data structures from the scheduler-side would be needed to be replicated in order for backups to know which executions have happened in the task graph?

Looking into the source code for the scheduler and the API docs, the scheduler contains the states of all tasks in the “tasks dict”, and a log of all transitions that have happened can be found in the transition log. Are there any other data structures that would be necessary to keep a track of to keep the other schedulers consistent?

guillaumeeb · March 2, 2023, 9:49am

Hi @PatrikZhong, welcome to this forum!

Have you taken a look at Resilience · Issue #1072 · dask/distributed · GitHub? Maybe if you really want to implement such a mechanism you could revive the discussion there?

PatrikZhong · March 2, 2023, 11:36am

Okay thank you, I’ll try my shot over there!

Topic		Replies	Views
Is there any way how to make scheduler resilient against restarts? Distributed	5	239	October 11, 2023
Signalling that a task should be rescheduled at a later time Distributed scheduler	2	280	April 27, 2022
Optimistic Memory Scheduling Distributed	5	225	March 23, 2023
Tracking the lifetime of a cluster-wide shared resource Distributed	3	250	November 9, 2021
Debugging Dask - Futures API Distributed distributed	4	287	May 12, 2022

Prototyping highly available scheduler

Related topics