Killed Worker error

I’m encoutering the follwowing error in coiled environment.

distributed.scheduler.KilledWorker: Attempted to run task (‘merge_chunk-baaaf31d5ffee5939965f604d435144f’, 0) on 4 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls___. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see Why did my worker die? — Dask.distributed 2023.12.1 documentation.

I’m unable to decypher from the logs as to what might be the issue.

Hi @mihir, welcome to Dask community!

Do you have the Scheduler and Worker logs? Do you have any other information in your main process, like warnings about restarting Workers?

A common failure in a merge is a memory issue.

Hi @guillaumeeb , No warnings on restarting workers.
I do see that even though I have assigned 20 workers, only 1 worker is actively taking the load with oftern “spilled to disk” events.

Here are some logs from the active worker

2023-12-22 16:01:19,198 - distributed.worker.memory - WARNING - Worker is at 6% memory usage. Resuming worker. Process memory: 1.76 GiB – Worker memory limit: 28.53 GiB

2023-12-22 16:01:18,277 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see Worker Memory Management — Dask.distributed 2023.12.1+16.g81774d4 documentation for more information. – Unmanaged memory: 22.86 GiB – Worker memory limit: 28.53 GiB

2023-12-22 16:01:18,276 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker. Process memory: 22.86 GiB – Worker memory limit: 28.53 GiB

2023-12-22 16:01:08,237 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see Worker Memory Management — Dask.distributed 2023.12.1+16.g81774d4 documentation for more information. – Unmanaged memory: 20.06 GiB – Worker memory limit: 28.53 GiB


Logs from scheduler


2023-12-22 16:06:21,837 - distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.0.38:33783

2023-12-22 16:06:21,836 - distributed.scheduler - INFO - Register worker <WorkerState ‘tls://10.0.0.38:33783’, name: mihir-gpu-worker-16149ad88e, status: running, memory: 0, processing: 0>

2023-12-22 16:06:20,608 - distributed.scheduler - INFO - Close client connection: Client-b04a3886-a0e2-11ee-860f-96156fc037e1

2023-12-22 16:06:20,607 - distributed.scheduler - INFO - Remove client Client-b04a3886-a0e2-11ee-860f-96156fc037e1

2023-12-22 16:06:20,607 - distributed.scheduler - INFO - Remove client Client-b04a3886-a0e2-11ee-860f-96156fc037e1

2023-12-22 16:06:20,338 - distributed.scheduler - INFO - Task (‘set_index-index-050e7007f9f827770b99160395ed8787’, 0) marked as failed because 4 workers died while trying to run it

2023-12-22 16:06:20,338 - distributed.scheduler - INFO - Remove worker <WorkerState ‘tls://10.0.0.38:36811’, name: mihir-gpu-worker-16149ad88e, status: paused, memory: 6, processing: 1> (stimulus_id=‘handle-worker-cleanup-1703261180.3383474’)

2023-12-22 16:05:00,091 - distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.0.30:45695

2023-12-22 16:05:00,090 - distributed.scheduler - INFO - Register worker <WorkerState ‘tls://10.0.0.30:45695’, name: mihir-gpu-worker-c46dfdec3c, status: running, memory: 0, processing: 0>

2023-12-22 16:04:59,009 - distributed.scheduler - INFO - Remove worker <WorkerState ‘tls://10.0.0.30:46585’, name: mihir-gpu-worker-c46dfdec3c, status: paused, memory: 6, processing: 1> (stimulus_id=‘handle-worker-cleanup-1703261099.0089679’)

2023-12-22 16:03:49,622 - distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.0.10:34177

2023-12-22 16:03:49,622 - distributed.scheduler - INFO - Register worker <WorkerState ‘tls://10.0.0.10:34177’, name: mihir-gpu-worker-dac61e5776, status: running, memory: 0, processing: 0>

2023-12-22 16:03:48,556 - distributed.scheduler - INFO - Remove worker <WorkerState ‘tls://10.0.0.10:34025’, name: mihir-gpu-worker-dac61e5776, status: paused, memory: 6, processing: 1> (stimulus_id=‘handle-worker-cleanup-1703261028.5567412’)

2023-12-22 16:02:40,915 - distributed.scheduler - INFO - Starting worker compute stream, tls://10.0.0.37:36859

2023-12-22 16:02:40,915 - distributed.scheduler - INFO - Register worker <WorkerState ‘tls://10.0.0.37:36859’, name: mihir-gpu-worker-321887a9c1, status: running, memory: 0, processing: 0>

2023-12-22 16:02:39,280 - distributed.scheduler - INFO - Remove worker <WorkerState ‘tls://10.0.0.37:43865’, name: mihir-gpu-worker-321887a9c1, status: paused, memory: 6, processing: 1> (stimulus_id=‘handle-worker-cleanup-1703260959.2802076’)


According to the log, this still looks like a memory problem, too much memory is being loaded onto one Worker. We would need the died workers logs to be sure.

Anyway I would advice to look at how your data looks like, how it is chunked, if a single merge partition can fit into memory.

1 Like