Why did my worker restart?

ClementAlba · November 9, 2022, 10:21am

Hi everyone,

I’m trying to do some heavy processing on a point cloud with a tool I have developed using Dask and PDAL. For simple processing, everything works fine but when I have a heavier treatment, I get this message :

2022-11-09 11:04:10,972 - distributed.nanny - WARNING - Restarting worker

And then everything crash. I had already did some researches and desactivate the heartbeat notion :

'distributed.scheduler.worker-ttl': None

Did someone can explain me why my worker restart ? And if there is any solution to fix this problem ?

Regards.

jrbourbeau · November 10, 2022, 7:25pm

Thanks for the post @ClementAlba. Do you have additional logs you can share? It’s good to see that single distributed.nanny - WARNING - Restarting worker log, but that’s unfortunately not enough to figure out what might be causing the worker to be restarted.

Just as a sanity check, client.restart() isn’t being called in your code, correct?

ClementAlba · November 14, 2022, 8:34am

Tanks for your response @jrbourbeau !

There is the complete error log :

2022-11-14 09:23:10,765 - distributed.nanny - WARNING - Restarting worker
2022-11-14 09:23:18,711 - distributed.nanny - WARNING - Restarting worker
2022-11-14 09:24:43,894 - distributed.nanny - WARNING - Restarting worker
2022-11-14 09:25:56,759 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "D:\applications\miniconda3\envs\pdal_parallelizer_env\Scripts\pdal-parallelizer-script.py", line 10, in <module>
    sys.exit(main())
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "D:\applications\miniconda3\envs\pdal_parallelizer_env\lib\site-packages\pdal_parallelizer\__main__.py", line 127, in process_pipelines
    compute_and_graph(client=client, tasks=delayed, output_dir=output, diagnostic=diagnostic)
  File "D:\applications\miniconda3\envs\pdal_parallelizer_env\lib\site-packages\pdal_parallelizer\__main__.py", line 51, in compute_and_graph
    dask.compute(*tasks)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\dask\base.py", line 600, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 3096, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2265, in gather
    return self.sync(
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 339, in sync
    return sync(
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 379, in f
    result = yield future
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2128, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task process-8c58fc94-d417-4202-a81f-188e06ed4ed9 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:60347. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
2022-11-14 09:26:54,449 - distributed.nanny - WARNING - Worker process still alive after 3.1999987792968754 seconds, killing
2022-11-14 09:26:54,477 - distributed.nanny - WARNING - Worker process still alive after 3.1999992370605472 seconds, killing

And in case it can help you, there is the dask configuration I set up :

    timeout = input('After how long of inactivity do you want to kill your worker (timeout)\n')

    cfg.set({'interface': 'lo'})
    cfg.set({'distributed.scheduler.worker-ttl': None})
    cfg.set({'distributed.comm.timeouts.connect': timeout})
    cluster = LocalCluster(n_workers=n_workers, threads_per_worker=threads_per_worker)
    client = Client(cluster)

No, client.restart() is not called in my code.

jrbourbeau · November 14, 2022, 8:54pm

Thanks @ClementAlba. The distributed.scheduler.KilledWorker exception looks problematic. My guess is something is going wrong in one of your process tasks, kills a worker, then the nanny attempts to restart the worker to get it out of a bad state. Did the worker logs, as mentioned in the KilledWorker traceback message, have anything useful in them?

ClementAlba · November 15, 2022, 11:30am

@jrbourbeau Where can I find the worker logs ? I tried to print the result of client.get_worker_logs but KilledWorker is not mentionned.

However I think I know why the KilledWorker error occurs :

If I read this arcticle, the error occurs because a process has been tried on my 3 different workers :

Note the special case of KilledWorker: this means that a particular task was tried on a worker, and it died, and then the same task was sent to another worker, which also died. After a configurable number of deaths (config key distributed.scheduler.allowed-failures), Dask decides to blame the task itself, and returns this exception. Note, that it is possible for a task to be unfairly blamed - the worker happened to die while the task was active, perhaps due to another thread - complicating diagnosis.

And here is another output whth the KilledWorker error :

D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00005.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle
2022-11-15 12:13:16,320 - distributed.nanny - WARNING - Restarting worker
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle
2022-11-15 12:14:59,708 - distributed.nanny - WARNING - Restarting worker
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00005.pickle
2022-11-15 12:15:05,569 - distributed.nanny - WARNING - Restarting worker
Trying to suppress D:/data_dev/street_pointcloud_process/tempD:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle but cannot found this file.
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle
Trying to suppress D:/data_dev/street_pointcloud_process/tempD:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle but cannot found this file.
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00007.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
2022-11-15 12:18:21,899 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
  File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 142, in <module>
    main()
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 128, in process_pipelines
    compute_and_graph(client=client, tasks=delayed, output_dir=output, diagnostic=diagnostic)
  File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 51, in compute_and_graph
    dask.compute(*tasks)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\dask\base.py", line 600, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 3096, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2265, in gather
    return self.sync(
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 339, in sync
    return sync(
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 379, in f
    result = yield future
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2128, in _gather
    raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task process-60c5aa78-0ae2-42cb-aba6-33189defccc7 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:50741. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
2022-11-15 12:19:59,478 - distributed.nanny - WARNING - Worker process still alive after 3.1999989318847657 seconds, killing

So apparently there is a problem with the pipeline stored in the file D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle. I will continue to investigate.

Topic		Replies	Views
How to retry hanging jobs during a distributed computation Distributed dask-array , distributed	3	947	May 4, 2022
Dask workers killed because of heartbeat fail Distributed worker , distributed	3	4209	August 1, 2022
How do I avoid distributed.client - WARNING - Couldn't gather keys, rescheduling? Distributed dask-gateway , delayed , distributed	9	739	September 10, 2023
Workers get restarted but there are no error messages etc Distributed distributed , dask-ml	8	316	August 21, 2023
Dask Progress Tasks Restart Distributed dask-array , dashboard , distributed	2	351	May 11, 2022

Why did my worker restart?

Related topics