Intermittent `CommClosedError` / worker–worker connection resets under load on HPC despite high bandwidth and increased timeouts

Problem description

I am running a Dask application on an HPC cluster (SLURM) processing ~200 GB of Zarr data across 3 nodes. My dask cluster creation logic spins up multiple dask workers per node, ensuring that each worker has 4 threads and ~96 GB RAM. Cluster nodes have 384 / 768 GB RAM, and the number of workers per node is determined dynamically based on available memory.

My application completes successfully, but worker stdout frequently shows errors such as Worker stream died during communication, CommClosedError, and Connection reset by peer, often preceded by Event loop was unresponsive in Worker for ~50s (the time varies, sometimes goes upto 100s).

The environment

We use spack as our package management tool, so the latest versions are restricted by what is available on spack package which includes following versions:

python 3.11.11 (custom version build)
dask: 2025.7.0
distributed: 2025.7.0
numpy: 2.2.6

Observations and mitigations tried

Initially, under the same workload, workers were actually dying after a few such communication failures. I found a related discussion about running Dask on Summit (Scaling distributed dask to run 27,648 dask workers on the Summit supercomputer · Issue #3691 · dask/distributed · GitHub) where increasing communication timeouts helped with similar symptoms. I applied the following Dask configuration:

distributed:
  comm:
    timeouts:
      connect: "600s" 
      tcp: "900s" 
  scheduler:
    worker-ttl: "15 minutes"

After this change, the application now runs to completion and workers no longer die, but the same communication errors still appear intermittently in the logs, though less frequently.

What I suspect

I have independently benchmarked node-to-node bandwidth (~9 GiB/s), and workers can reliably transfer 4GB (max that I tried) NumPy arrays in isolation, so this does not appear to be raw network throughput or memory pressure. My current hypothesis is event-loop starvation on workers due to long-running, GIL-holding work inside tasks (e.g. a few C++ libraries which I call using python bindings, which do not explicitly release the GIL), combined with multiple threads per worker, causing delayed heartbeats or TCP servicing and resulting in transient worker–worker connection resets.

Reproducibility

I attempted to construct a minimal reproducible example outside of my full application (including synthetic workloads that intentionally hold the GIL and trigger worker–worker transfers), but I was not able to reliably reproduce the same CommClosedError / connection reset behavior in isolation. The errors seem to arise only under the full application workload, which makes me suspect an interaction between task scheduling, GIL-holding code paths, and communication under sustained load rather than a single isolated operation.

What I’m asking

I’m looking for guidance on:

  1. how to conclusively diagnose whether this is indeed GIL / event-loop starvation (vs. something else, like dask internals under heavy load)
  2. recommended worker/thread configurations for large, IO-heavy HPC workloads to avoid these communication instabilities.

Examples of the logs:

2026-02-18 13:08:49,744 - distributed.core - INFO - Event loop was unresponsive in Worker for 3.03s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2026-02-18 13:38:00,936 - distributed.worker - ERROR - Worker stream died during communication: tcp://10.192.61.105:36345
tornado.iostream.StreamClosedError: Stream is closed
    convert_stream_closed_error(self, e)
  File "/spack/install/linux-ubuntu22.04-x86_64_v3/none-none/py-distributed-2025.7.0-zkoleiuueexvzmukslghrmigfxx3tvv4/lib/python3.11/site-packages/distributed/comm/tcp.py", line 137, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://10.192.61.85:53954 remote=tcp://10.192.61.105:36345>: Stream is closed
2026-02-18 13:38:00,938 - distributed.worker - ERROR - Worker stream died during communication: tcp://10.192.61.105:36345
ConnectionResetError: [Errno 104] Connection reset by peer

2026-02-18 13:38:00,937 - distributed.worker - ERROR - Worker stream died during communication: tcp://10.192.61.105:36345
ConnectionResetError: [Errno 104] Connection reset by peer
    convert_stream_closed_error(self, e)
  File "/spack/install/linux-ubuntu22.04-x86_64_v3/none-none/py-distributed-2025.7.0-zkoleiuueexvzmukslghrmigfxx3tvv4/lib/python3.11/site-packages/distributed/comm/tcp.py", line 143, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://10.192.61.85:47574 remote=tcp://10.192.61.105:36345>: ConnectionResetError: [Errno 104] Connection reset by peer
2026-02-18 13:38:00,938 - distributed.worker - ERROR - Worker stream died during communication: tcp://10.192.61.105:36345
ConnectionResetError: [Errno 104] Connection reset by peer