Hi,
I am experiencing deadlocks with a python application using, among others, dask.distributed, xarray, zarr, SMB storage.
The application runs continuously with a PC reboot every ~24h. It has run OK for several months with dask 2023.6.0.
Updates were then made and dask/xarray/distributed updated to the latest stable in 07/2024. Since then I have experienced deadlocks and I cannot trace them to a particular change.
After suspecting many components of the application, a deadlock recently occurred in a rare period while the system was doing almost nothing: checking the content of a text file on SMB storage every 30s (with a proper context manager with path.open("r")
). This deadlock happened several hours after the main application performed its usual tasks, in the same python session (I cannot rule out that something bad occurred at that time leading to the deadlock). I think that this deadlock occurrence still narrows down the issue a lot.
My only way to break the deadlocks is to Ctrl-C the console running the python main, which always gives this traceback (and stops the application):
Traceback
ERROR 2024-08-17T07:44:21.081 | 11972 | Dask Worker process (from Nanny) | MainThread | 11968 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x0000020C6073F640>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.080 | 11724 | Dask Worker process (from Nanny) | MainThread | 11728 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x00000290266AF100>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.081 | 2228 | Dask Worker process (from Nanny) | MainThread | 2148 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x000001AE2FB7E0C0>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.082 | 9744 | Dask Worker process (from Nanny) | MainThread | 1372 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x0000027015E72940>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.083 | 1672 | Dask Worker process (from Nanny) | MainThread | 1668 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x0000026ECE076800>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.084 | 6832 | Dask Worker process (from Nanny) | MainThread | 1612 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x0000016CDB987940>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.090 | 11880 | Dask Worker process (from Nanny) | MainThread | 11872 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x0000015867EB5240>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
ERROR 2024-08-17T07:44:21.092 | 11592 | Dask Worker process (from Nanny) | MainThread | 11636 | utils | distributed.worker | __exit__ | 862 | (<class 'asyncio.exceptions.CancelledError'>, CancelledError(), <traceback object at 0x000001BABFD1B480>) | None
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\compatibility.py", line 204, in asyncio_run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 691, in run_until_complete
self.run_forever()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 658, in run_forever
self._run_once()
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\base_events.py", line 2070, in _run_once
event_list = self._selector.select(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 323, in select
r, w, _ = self._select(self._readers, self._writers, [], timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\selectors.py", line 314, in _select
r, w, x = select.select(r, w, w, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\asyncio\runners.py", line 157, in _on_sigint
raise KeyboardInterrupt()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\utils.py", line 837, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\worker.py", line 1558, in close
await r.close_gracefully(reason=reason)
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1251, in send_recv_from_rpc
comm = await self.pool.connect(self.addr)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1479, in connect
return await self._connect(addr=addr, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\core.py", line 1423, in _connect
comm = await connect(
^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\core.py", line 377, in connect
handshake = await comm.read()
^^^^^^^^^^^^^^^^^
File "C:\Users\myuser\AppData\Local\anaconda3\Lib\site-packages\distributed\comm\tcp.py", line 225, in read
frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
asyncio.exceptions.CancelledError
- this simple text file reading task was not performed when all the previous deadlocks occurred
- in the application there is always a distributed process-based client running (8 processes with 4 threads each, 32 core PC)
- the deadlocks typically occur after 2-10 hours of the main application running but still seem random: yesterday it ran for 24 hours without deadlock, today 3 deadlocks already occurred.
- after activating all possible distributed logging on the client and workers, I only found one warning that apparently is not responsible for the deadlock, see here.
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (âEnglish_United Statesâ, â1252â)
libhdf5: 1.12.1
libnetcdf: None
xarray: 2024.6.0
pandas: 2.0.3
numpy: 1.24.3
scipy: 1.13.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: 3.9.0
zarr: 2.18.2
cftime: None
nc_time_axis: None
iris: None
bottleneck: 1.3.5
dask: 2024.7.1
distributed: 2024.7.1
matplotlib: 3.8.2
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2024.6.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.0.0
pip: 24.1.2
conda: 23.9.0
pytest: 7.4.0
mypy: None
IPython: 8.15.0
sphinx: 5.0.2
The main application runs about 100k tasks per hour: mostly analyzing images saved on SMB storage and storing numeric results in a xarray dataset saved in .zarr format on SMB storage.
Less than 10x/ hour there are occurrences of Worker failed to heartbeat for 83s; attempting restart
. To inspect the tasks of the dying worker, I added a log statement of the worker.processing
in check_worker_ttl here in the Scheduler code. There are usually about 4-6 tasks. The only common feature I could find among the hanging tasks is that they read data from SMB storage. I added many try/except in these tasks which do not flag anything. I am currently imagining a sort of freeze upon entering a with path.open():
or PIL.Image.open(path)
with files stored on SMB storage.
My current 2 hypotheses are:
- something freezes upon file read on SMB storage (I now added a logger.debug message before and after any file read, waiting for the next deadlock).
- distributed alone ends up deadlocking after some time
During a deadlock:
- the client dashboard is not available any more. I also cannot access the client from another python process when specifying the port address.
- the many heartbeating logging debug messages of distributed.core stop
- I once added some logging in a root asyncio or distributed loop (I think the one shown in the traceback), and I could see that the loop was alive during a deadlock when everything else was frozen.
One observation: most of my tasks are bound methods of agents originally defined using dataclasses. I have seen this older comment and this recent PR that maybe hint at some instability with dataclasses.
Questions
- does the deadlock traceback upon KeyboardInterrupt tell something that I missed?
- are the traceback and symptoms consistent with the hypothesis that my system sometimes freezes when reading data?
- is it otherwise possible that distributed simply deadlocks out of nowhere after several hours, and I am wrongly blaming interaction with storage?
- what can I do to locate the deadlock? should I register plugins on workers to log some specific events?
Thanks for any help.