Transitioning dictionary low level graph from generator to HighLevelGraph Layer

Currently dask.array.histogramdd uses a dictionary style low level graph. It’s built in the style of a generator (using dictionary comprehension). A piece of the implementation can be seen here: dask/routines.py at 2021.11.1 · dask/dask · GitHub

I started trying to convert this to Blockwise usage, but ran into problems trying to figure out exactly out to define the indices when creating a layer with the existing dask.blockwise.blockwise and dask.array.blockwise functions. (Basically the indices of the input data are unrelated to the indices of the output array).

At this point I’m thinking about trying to create my own dask.highlevelgraph.Layer.

I’m wondering if anyone with some knowledge in this space can provide (if it exists) any kind of prescription for translating a known low level dictionary model into a highlevelgraph.Layer implementation

Hi @ddavis, I don’t have a strong grasp on the histogram algorithm you linked, but it sounds correct to me that Blockwise might not work well if the output indices aren’t relatable to the input indices via tensor-style operations. I think @rjzamora might be able to comment more here on whether there is any way to make that work.

I do have a few thoughts on converting low-level graphs to high-level graphs (with both successful and unsuccessful attempts). Most prominently: avoid accidentally materializing the graph!

The process of culling HLG involves going up the stack of layers, and having each layer to cull itself based on the on the keys that its dependents need. So I would recommend carefully looking at the implementation for HighLevelGraph.cull and see which Layer methods are called – when you are implementing your layer, these are the ones that should especially avoid materialization.

In particular: the default implementations for keys and get_output_keys() will materialize the graph, so one should take special care to override them in a way that is cheap to compute.

Of course, all of this stuff should be written better-documented, perhaps if we can collect some best-practices here, it will allow us to update the official docs.