Hello Dask Comunity!
I am facing the following problems, there are 2. We have out service which used Dask DataFrame for operations and we always executed some large dataset tests (1 milion and 2 million data). We always used static mode, 4 workers with spilling to disk, 2GB Pod memory limit and 1GB of Dask worker limit which was always fine. But now we are trying to use adaptive mode deployed in Kube cluster. Both of our Dask version and kube operator are 2025.4.1.
When we execute some really heavy CPU job and we use scaling 2 to 6 workers for example, a worker is dying because of CPU load.
This is the log from the worker:
[INFO] Remove worker addr: tcp://10.1.80.131:40681 name: lm-cluster-default-worker-cf2b1454b5 (stimulus_id=‘handle-worker-cleanup-1754048275.9446254’)
distributed.scheduler: scheduler.py:5512[ERROR] Task (‘chunk-10cd24fe0994cd4a8ff6cd05dfe6c480’, 31) marked as failed because 1 workers died while trying to run it
How i fixed the issue is by using 3 threads per worker and worker-ttl: 5m
Well for me it doesn’t make since to make the worker-ttl really high since it should always have a thread to send to the scheduler a heartbeat. If i remove it it starts crashing. I cannot say “Please reserve 1 thread to communicate with the scheduler only“. Do i miss something in the configuration?
Question 2:
Can please someone tell me what is going on, why this worker is being killed while still have some processing and memory? It seems that worker is trying to get data from another but its already dead
LOG FROM WORKER:
[ERROR] failed during get data with tcp://10.1.80.185:42683 → tcp://10.1.80.137:46005
<TCP (closed) local=tcp://10.1.80.185:42683 remote=tcp://10.1.80.137:56020>: ConnectionResetError: [Errno 104] Connection reset by peer
[INFO] Lost connection to ‘tcp://10.1.80.137:56020’
LOG FROM SCHEDULER:
[INFO] Worker status running → closing - <WorkerState ‘tcp://10.1.80.137:46005’, name: lm-cluster-default-worker-efa2a42e29, status: closing, memory: 59, processing: 3>
[INFO] Received ‘close-stream’ from tcp://10.1.80.137:40236; closing.
[INFO] Remove worker addr: tcp://10.1.80.137:46005 name: lm-cluster-default-worker-efa2a42e29 (stimulus_id=‘handle-worker-cleanup-1754401644.3424878’)
[ERROR] Task (‘add-b48bae854041d9f44435a6df4586833f’, 72) marked as failed because 1 workers died while trying to run it
[ERROR] Task (‘operation-1d6fd89bb12125bdea75ea6fd00a50cb’, 70) marked as failed because 1 workers died while trying to run it
[WARNING] Removing worker ‘tcp://10.1.80.137:46005’ caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {(‘operation-011c3dbd3d6b2bbfb90c31f1ff44a05c’, 14), (‘operation-011c3dbd3d6b2bbfb90c31f1ff44a05c’, 20), …
[DEBUG] Removed worker <WorkerState ‘tcp://10.1.80.137:46005’, name: lm-cluster-default-worker-efa2a42e29, status: closed, memory: 0, processing: 0>
Thank you in advance