Problem description
I am running a Dask application on an HPC cluster (SLURM) processing ~200 GB of Zarr data across 3 nodes. My dask cluster creation logic spins up multiple dask workers per node, ensuring that each worker has 4 threads and ~96 GB RAM. Cluster nodes have 384 / 768 GB RAM, and the number of workers per node is determined dynamically based on available memory.
My application completes successfully, but worker stdout frequently shows errors such as Worker stream died during communication, CommClosedError, and Connection reset by peer, often preceded by Event loop was unresponsive in Worker for ~50s (the time varies, sometimes goes upto 100s).
The environment
We use spack as our package management tool, so the latest versions are restricted by what is available on spack package which includes following versions:
python 3.11.11 (custom version build)
dask: 2025.7.0
distributed: 2025.7.0
numpy: 2.2.6
Observations and mitigations tried
Initially, under the same workload, workers were actually dying after a few such communication failures. I found a related discussion about running Dask on Summit (Scaling distributed dask to run 27,648 dask workers on the Summit supercomputer · Issue #3691 · dask/distributed · GitHub) where increasing communication timeouts helped with similar symptoms. I applied the following Dask configuration:
distributed:
comm:
timeouts:
connect: "600s"
tcp: "900s"
scheduler:
worker-ttl: "15 minutes"
After this change, the application now runs to completion and workers no longer die, but the same communication errors still appear intermittently in the logs, though less frequently.
What I suspect
I have independently benchmarked node-to-node bandwidth (~9 GiB/s), and workers can reliably transfer 4GB (max that I tried) NumPy arrays in isolation, so this does not appear to be raw network throughput or memory pressure. My current hypothesis is event-loop starvation on workers due to long-running, GIL-holding work inside tasks (e.g. a few C++ libraries which I call using python bindings, which do not explicitly release the GIL), combined with multiple threads per worker, causing delayed heartbeats or TCP servicing and resulting in transient worker–worker connection resets.
Reproducibility
I attempted to construct a minimal reproducible example outside of my full application (including synthetic workloads that intentionally hold the GIL and trigger worker–worker transfers), but I was not able to reliably reproduce the same CommClosedError / connection reset behavior in isolation. The errors seem to arise only under the full application workload, which makes me suspect an interaction between task scheduling, GIL-holding code paths, and communication under sustained load rather than a single isolated operation.
What I’m asking
I’m looking for guidance on:
- how to conclusively diagnose whether this is indeed GIL / event-loop starvation (vs. something else, like dask internals under heavy load)
- recommended worker/thread configurations for large, IO-heavy HPC workloads to avoid these communication instabilities.