Background on new graph specification

Hi all, I noticed that the “Specification” page was updated (dask/dask#11915), so that the “plain” representation of the graph using “ordinary Python data structures” is now legacy, and the new one uses DataNode, Task and List objects, and manages references explicitly with .ref() rather than implicitly.

Would love to know more about the background and rationale, and couldn’t find it. I notice these objects have been introduced a while ago, but only now the documentation has been updated. Are these the result of the work that started with High Level Query Optimization in Dask ? Is there any writeup on why the legacy “plain” dictionary representation of the graph wasn’t enough for what the Dask project wants to achieve? And one more thing, does the new spec fundamentally change how indirect dependencies are expressed?

(I put this in “Dask Bag” because I don’t think it neatly fits in the other categories, but feel free to move the post)

Hi @astrojuanlu, welcome to Dask discourse!

Well, I guess you’ve read the new page, but I’ll still put the explanation at the end of the page here:

The tuples are objectively a more compact representation than the Task class so why did we choose to introduce this new representation?

As a tuple, the task is not self-describing and heavily context dependent. The meaning of a tuple like (func, "x", "y") is depending on the graph it is embedded in. The literals x and y could be either actual literals that should be passed to the function or they could be references to other tasks. Therefore, the interpretation of this task has to walk the tuple recursively and compare every single encountered element with known keys in the graph. Especially for large graphs or deeply nested tuple arguments, this can be a performance bottleneck. For APIs that allow users to define their own key names this can further cause false positives where intended literals are replaced by pre-computed task results.

And I think this was also introduced with the work started in dask-expr, so query optimizations.

I’m not sure about this question. This shouldn’t have any impact on the use of Dask APIs or your usage of Dask.

cc @fjetter @Patrick

Uh how embarrasing, I missed that section every time. Thanks for highlighting it :folded_hands: as well as the extra context!