I’ve been looking to prototype a replicated/highly-available scheduler with a primary scheduler and a few backups. The idea is that the primary continuously replicate its state to the backups so that in case the primary crashes, a backup scheduler can take over without having to restart the computations. My main question is what data structures from the scheduler-side would be needed to be replicated in order for backups to know which executions have happened in the task graph?
Looking into the source code for the scheduler and the API docs, the scheduler contains the states of all tasks in the “tasks dict”, and a log of all transitions that have happened can be found in the transition log. Are there any other data structures that would be necessary to keep a track of to keep the other schedulers consistent?