Dask scheduler lost connection to high workload worker

ubw218 · March 19, 2022, 5:58pm

I have a delayed function that uses all the available threads on a worker. The workload is heavy and it won’t finish for hours. I notice while that is running. The scheduler keeps getting error like the one below. And any print() won’t make it way to the worker log until that whole function finishes.

I’m wondering if there’s a way to have a dedicated thread on each worker that takes basic heart beat and other works(like those added from client.submit()) ?

2022-03-19 17:48:06,849+0000 ERROR [MainThread] distributed.core: Exception while handling op broadcast
Traceback (most recent call last):
  File "/dependencies/lib/python3.8/site-packages/distributed/comm/core.py", line 284, in connect
    comm = await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:


Traceback (most recent call last):
  File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 521, in handle_comm
    result = await result
  File "/dependencies/lib/python3.8/site-packages/distributed/scheduler.py", line 6020, in broadcast
    results = await All(
  File "/dependencies/lib/python3.8/site-packages/distributed/utils.py", line 208, in All
    result = await tasks.next()
  File "/dependencies/lib/python3.8/site-packages/distributed/scheduler.py", line 6012, in send_message
    comm = await self.rpc.connect(addr)
  File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 1071, in connect
    raise exc
  File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 1055, in connect
    comm = await fut
  File "/dependencies/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://172.27.0.19:37745 after 30 s

pavithraes · March 21, 2022, 5:58pm

@ubw218 Thanks for your question! Could you please share a minimal, reproducible example? It’ll allow us to help you better.

Topic		Replies	Views
General cause/scenarios for `worker-handle-scheduler-connection-broken` error Distributed dask-gateway , distributed	8	1207	November 3, 2023
FutureCancelledError: scheduler-connection-lost due to high load? Distributed	8	368	December 19, 2024
Dask cluster with large number of workers gives "asyncio.exceptions.TimeoutError: Nanny failed to start" Distributed distributed	4	2187	April 27, 2022
Error in monitoring progress of distributed work - related to asyncio Distributed distributed	4	248	April 15, 2024
Scheduler stuck, unique keys runs slow with time Distributed scheduler	6	1681	June 2, 2022

Dask scheduler lost connection to high workload worker

Related topics