When running dask.compute(list_of_embarrassingly_parallel_delayed_objects)
and some error-traceback mentions one of the tasks, for example function-8353c4b7-954e-4cd6-be97-edd053e742ea
, what information can I retrieve with that string?
If a computation is started with dask.compute((function(a,b) for a,b in zip(range(100), range(100))))
, and an error mentions a task-key that fails, I am ideally looking for a way to retrieve a
and b
.
Example of traceback mentioning failed task
---------------------------------------------------------------------------
KilledWorker Traceback (most recent call last)
File <timed exec>:2
File /srv/conda/envs/notebook/lib/python3.10/site-packages/dask/base.py:600, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
597 keys.append(x.__dask_keys__())
598 postcomputes.append(x.__dask_postcompute__())
--> 600 results = schedule(dsk, keys, **kwargs)
601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:3122, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
3120 should_rejoin = False
3121 try:
-> 3122 results = self.gather(packed, asynchronous=asynchronous, direct=direct)
3123 finally:
3124 for f in futures.values():
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2291, in Client.gather(self, futures, errors, direct, asynchronous)
2289 else:
2290 local_worker = None
-> 2291 return self.sync(
2292 self._gather,
2293 futures,
2294 errors=errors,
2295 direct=direct,
2296 local_worker=local_worker,
2297 asynchronous=asynchronous,
2298 )
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
337 return future
338 else:
--> 339 return sync(
340 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
341 )
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
404 if error:
405 typ, exc, tb = error
--> 406 raise exc.with_traceback(tb)
407 else:
408 return result
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:379, in sync.<locals>.f()
377 future = asyncio.wait_for(future, callback_timeout)
378 future = asyncio.ensure_future(future)
--> 379 result = yield future
380 except Exception:
381 error = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/gen.py:769, in Runner.run(self)
766 exc_info = None
768 try:
--> 769 value = future.result()
770 except Exception:
771 exc_info = sys.exc_info()
File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2154, in Client._gather(self, futures, errors, direct, local_worker)
2152 exc = CancelledError(key)
2153 else:
-> 2154 raise exception.with_traceback(traceback)
2155 raise exc
2156 if errors == "skip":
KilledWorker: Attempted to run task function-8353c4b7-954e-4cd6-be97-edd053e742ea on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls://10.8.58.5:41111. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.