What information can I retrieve from a failed delayed objects name/key?

When running dask.compute(list_of_embarrassingly_parallel_delayed_objects) and some error-traceback mentions one of the tasks, for example function-8353c4b7-954e-4cd6-be97-edd053e742ea, what information can I retrieve with that string?

If a computation is started with dask.compute((function(a,b) for a,b in zip(range(100), range(100)))), and an error mentions a task-key that fails, I am ideally looking for a way to retrieve a and b.


Example of traceback mentioning failed task
---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
File <timed exec>:2

File /srv/conda/envs/notebook/lib/python3.10/site-packages/dask/base.py:600, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    597     keys.append(x.__dask_keys__())
    598     postcomputes.append(x.__dask_postcompute__())
--> 600 results = schedule(dsk, keys, **kwargs)
    601 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:3122, in Client.get(self, dsk, keys, workers, allow_other_workers, resources, sync, asynchronous, direct, retries, priority, fifo_timeout, actors, **kwargs)
   3120         should_rejoin = False
   3121 try:
-> 3122     results = self.gather(packed, asynchronous=asynchronous, direct=direct)
   3123 finally:
   3124     for f in futures.values():

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2291, in Client.gather(self, futures, errors, direct, asynchronous)
   2289 else:
   2290     local_worker = None
-> 2291 return self.sync(
   2292     self._gather,
   2293     futures,
   2294     errors=errors,
   2295     direct=direct,
   2296     local_worker=local_worker,
   2297     asynchronous=asynchronous,
   2298 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:339, in SyncMethodMixin.sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    337     return future
    338 else:
--> 339     return sync(
    340         self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    341     )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:406, in sync(loop, func, callback_timeout, *args, **kwargs)
    404 if error:
    405     typ, exc, tb = error
--> 406     raise exc.with_traceback(tb)
    407 else:
    408     return result

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/utils.py:379, in sync.<locals>.f()
    377         future = asyncio.wait_for(future, callback_timeout)
    378     future = asyncio.ensure_future(future)
--> 379     result = yield future
    380 except Exception:
    381     error = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/tornado/gen.py:769, in Runner.run(self)
    766 exc_info = None
    768 try:
--> 769     value = future.result()
    770 except Exception:
    771     exc_info = sys.exc_info()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/distributed/client.py:2154, in Client._gather(self, futures, errors, direct, local_worker)
   2152         exc = CancelledError(key)
   2153     else:
-> 2154         raise exception.with_traceback(traceback)
   2155     raise exc
   2156 if errors == "skip":

KilledWorker: Attempted to run task function-8353c4b7-954e-4cd6-be97-edd053e742ea on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls://10.8.58.5:41111. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.

Let’s look at some code:

from dask import delayed
from distributed import Client

import time

def sleeping_func(a,b):
    time.sleep(1)
    return a,b

delayed_calls = [delayed(sleeping_func)(a,b) for a,b in zip(range(100), range(100))]
delayed_calls[0].key

outputs:

'sleeping_func-48c4f535-feb8-40f6-8580-a0a7d9be4061'

So if you store the delayed objects, you’ll be able to see their keys. Then, either you know from the position in the list what where the argument to the call, either you want to store them on your side with a dict or something. You won’t find the argument in the delayed object.

1 Like

Oh, this is great. I will be able to know from the position in the list what were the arguments in my case. So I will store the delayed objects onward, and will be able to do a lookup in the list whenever this happens next time. Thanks!

1 Like