Search for deadlock cause | freeze upon data read; distributed; other?

templiert · August 19, 2024, 8:39pm

I used to have the default worker-ttl: "600s" in the distributed config. When these worker failures occurred, the dashboard was showing a freeze at the end of completing a batch of tasks, such as in the screenshot below, with 86/91 completed. Everything froze for 600s, the worker was restarted (by checker_worker_ttl), then things continued.

This symptom of freeze always occurred at 95% percent completion with only a few tasks left.

I see that this symptom of deadlock at the very end of a task batch has been reported in at least 2x more instances: here, and here.

I now have worker-ttl: "60s". When the worker failures occur I believe that there is no more general freeze due to the single worker failure (the true deadlocks still occur as reported in the original post).

Could the deadlock be caused by dask/distributed/#8616? If yes, then I could maybe simply upgrade to py3.12 instead of my current py3.11.

Topic		Replies	Views
How to retry hanging jobs during a distributed computation Distributed dask-array , distributed	3	931	May 4, 2022
Scheduler not saturating workers Distributed future , distributed	9	305	August 9, 2023
How do I avoid distributed.client - WARNING - Couldn't gather keys, rescheduling? Distributed dask-gateway , delayed , distributed	9	705	September 10, 2023
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1462	March 13, 2024
Accessing dashboard of scheduler started programmatically Distributed	7	364	September 15, 2023

Search for deadlock cause | freeze upon data read; distributed; other?

Related topics