I have a delayed function that uses all the available threads on a worker. The workload is heavy and it won’t finish for hours. I notice while that is running. The scheduler keeps getting error like the one below. And any print() won’t make it way to the worker log until that whole function finishes.
I’m wondering if there’s a way to have a dedicated thread on each worker that takes basic heart beat and other works(like those added from client.submit()) ?
2022-03-19 17:48:06,849+0000 ERROR [MainThread] distributed.core: Exception while handling op broadcast
Traceback (most recent call last):
File "/dependencies/lib/python3.8/site-packages/distributed/comm/core.py", line 284, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 521, in handle_comm
result = await result
File "/dependencies/lib/python3.8/site-packages/distributed/scheduler.py", line 6020, in broadcast
results = await All(
File "/dependencies/lib/python3.8/site-packages/distributed/utils.py", line 208, in All
result = await tasks.next()
File "/dependencies/lib/python3.8/site-packages/distributed/scheduler.py", line 6012, in send_message
comm = await self.rpc.connect(addr)
File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 1071, in connect
raise exc
File "/dependencies/lib/python3.8/site-packages/distributed/core.py", line 1055, in connect
comm = await fut
File "/dependencies/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise OSError(
OSError: Timed out trying to connect to tcp://172.27.0.19:37745 after 30 s