Debugging resource annotation issues

I use resource annotations (Worker Resources — Dask.distributed 2023.11.0+22.gdc06ce4 documentation) to ensure several GPU tasks are not simultaneously executed by various threads, as it can lead to OOMs.

However this mechanism is turning out to be brittle and minor unrelated changes to the pipeline are leading to simultaneous execution.

From

In most cases (such as the case above) the annotations for y may be lost during graph optimization before execution. You can avoid that by passing the optimize_graph=False keyword.

I do understand that resource annotations need not be honored by the scheduler, but I see this behavior even when using optimize_graph=False is being used.

Would it be possible to debug at the scheduler level how the annotations are being used and if they are being honored ?


NOT A CONTRIBUTION

Hm, this looks like a complex problem, and this is probably not normal that resources annotations are not being honored with optimize_graph=False.

Not an expert here, but I can think of two ways of trying to understand why:

  • Dig into the tasks graph or the HighLevelGraph to see if the annotations are taken into account.
  • Build a SchedulerPlugin and use the transition function to see each task annotation.

Hope someone else can chime in to give some better advice.

Alternatively to optimize_graph=False, you should be able to just write

dask.config.set({"optimization.fuse.active": False})

which will leave some optimizations on.
If you are still losing annotations after this please open a ticket with a reproducer.

In order to debug yourself:

x = ... # build dask collection
future = client.compute(x)
key = future.key

async def get_annotations(dask_scheduler):
    while key not in dask_scheduler.tasks:
        await asyncio.sleep(0.01)
    return {ts.key: ts.annotations for ts in dask_scheduler.tasks.values()}

annotations = client.run_on_scheduler(get_annotations)