@jrbourbeau Where can I find the worker logs ? I tried to print the result of client.get_worker_logs
but KilledWorker is not mentionned.
However I think I know why the KilledWorker
error occurs :
If I read this arcticle, the error occurs because a process has been tried on my 3 different workers :
Note the special case of KilledWorker: this means that a particular task was tried on a worker, and it died, and then the same task was sent to another worker, which also died. After a configurable number of deaths (config key distributed.scheduler.allowed-failures), Dask decides to blame the task itself, and returns this exception. Note, that it is possible for a task to be unfairly blamed - the worker happened to die while the task was active, perhaps due to another thread - complicating diagnosis.
And here is another output whth the KilledWorker
error :
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00005.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle
2022-11-15 12:13:16,320 - distributed.nanny - WARNING - Restarting worker
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle
2022-11-15 12:14:59,708 - distributed.nanny - WARNING - Restarting worker
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00005.pickle
2022-11-15 12:15:05,569 - distributed.nanny - WARNING - Restarting worker
Trying to suppress D:/data_dev/street_pointcloud_process/tempD:/data_dev/street_pointcloud_process/temp/temp__wam-ob00002.pickle but cannot found this file.
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle
Trying to suppress D:/data_dev/street_pointcloud_process/tempD:/data_dev/street_pointcloud_process/temp/temp__wam-ob00006.pickle but cannot found this file.
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00007.pickle
D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
2022-11-15 12:18:21,899 - distributed.nanny - WARNING - Restarting worker
Traceback (most recent call last):
File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 142, in <module>
main()
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 128, in process_pipelines
compute_and_graph(client=client, tasks=delayed, output_dir=output, diagnostic=diagnostic)
File "D:\calba\pdal-parallelizer\src\pdal_parallelizer\__main__.py", line 51, in compute_and_graph
dask.compute(*tasks)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\dask\base.py", line 600, in compute
results = schedule(dsk, keys, **kwargs)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 3096, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2265, in gather
return self.sync(
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 339, in sync
return sync(
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 406, in sync
raise exc.with_traceback(tb)
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\utils.py", line 379, in f
result = yield future
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\tornado\gen.py", line 762, in run
value = future.result()
File "C:\Users\calba\AppData\Roaming\Python\Python39\site-packages\distributed\client.py", line 2128, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task process-60c5aa78-0ae2-42cb-aba6-33189defccc7 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:50741. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
2022-11-15 12:19:59,478 - distributed.nanny - WARNING - Worker process still alive after 3.1999989318847657 seconds, killing
So apparently there is a problem with the pipeline stored in the file D:/data_dev/street_pointcloud_process/temp/temp__wam-ob00004.pickle
. I will continue to investigate.