I’m also findind something similar (see traceback below). Sadly it’s not easy to create an MCVE probably because it’s a complex function and often happens on remote machines e.g. an EC2 machine and in my case it’s a shared jupyterhub instance.
Would it be possible to elaborate in Why did my worker die? — Dask.distributed 2022.8.1 documentation for CommClosedError?
2022-08-19 02:50:30,752 - distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 223, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/worker.py", line 1159, in heartbeat
response = await retry_operation(
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/utils_comm.py", line 383, in retry_operation
return await retry(
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/utils_comm.py", line 368, in retry
return await coro()
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/core.py", line 1154, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/core.py", line 919, in send_recv
response = await comm.read(deserializers=deserializers)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 239, in read
convert_stream_closed_error(self, e)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:35462 remote=tcp://127.0.0.1:40265>: Stream is closed
2022-08-19 02:50:30,805 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'tcp://127.0.0.1:34603'.
2022-08-19 02:50:30,805 - distributed.worker - WARNING - Heartbeat to scheduler failed
Traceback (most recent call last):
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 223, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/worker.py", line 1159, in heartbeat
response = await retry_operation(
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/utils_comm.py", line 383, in retry_operation
return await retry(
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/utils_comm.py", line 368, in retry
return await coro()
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/core.py", line 1154, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/core.py", line 919, in send_recv
response = await comm.read(deserializers=deserializers)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 239, in read
convert_stream_closed_error(self, e)
File "/opt/userenvs/ray.bell/env/lib/python3.9/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:35484 remote=tcp://127.0.0.1:40265>: Stream is closed
2022-08-19 02:50:30,812 - distributed.worker - ERROR - Scheduler was unaware of this worker 'tcp://127.0.0.1:34603'. Shutting down.