I used to have the default worker-ttl: "600s"
in the distributed config. When these worker failures occurred, the dashboard was showing a freeze at the end of completing a batch of tasks, such as in the screenshot below, with 86/91 completed. Everything froze for 600s, the worker was restarted (by checker_worker_ttl), then things continued.
This symptom of freeze always occurred at 95% percent completion with only a few tasks left.
I see that this symptom of deadlock at the very end of a task batch has been reported in at least 2x more instances: here, and here.
I now have worker-ttl: "60s"
. When the worker failures occur I believe that there is no more general freeze due to the single worker failure (the true deadlocks still occur as reported in the original post).
- Could the deadlock be caused by dask/distributed/#8616? If yes, then I could maybe simply upgrade to py3.12 instead of my current py3.11.