Background on new graph specification

astrojuanlu · May 20, 2025, 5:44am

Hi all, I noticed that the “Specification” page was updated (dask/dask#11915), so that the “plain” representation of the graph using “ordinary Python data structures” is now legacy, and the new one uses DataNode, Task and List objects, and manages references explicitly with .ref() rather than implicitly.

Would love to know more about the background and rationale, and couldn’t find it. I notice these objects have been introduced a while ago, but only now the documentation has been updated. Are these the result of the work that started with High Level Query Optimization in Dask ? Is there any writeup on why the legacy “plain” dictionary representation of the graph wasn’t enough for what the Dask project wants to achieve? And one more thing, does the new spec fundamentally change how indirect dependencies are expressed?

(I put this in “Dask Bag” because I don’t think it neatly fits in the other categories, but feel free to move the post)

guillaumeeb · May 23, 2025, 2:05pm

Hi @astrojuanlu, welcome to Dask discourse!

Well, I guess you’ve read the new page, but I’ll still put the explanation at the end of the page here:

The tuples are objectively a more compact representation than the Task class so why did we choose to introduce this new representation?

As a tuple, the task is not self-describing and heavily context dependent. The meaning of a tuple like (func, "x", "y") is depending on the graph it is embedded in. The literals x and y could be either actual literals that should be passed to the function or they could be references to other tasks. Therefore, the interpretation of this task has to walk the tuple recursively and compare every single encountered element with known keys in the graph. Especially for large graphs or deeply nested tuple arguments, this can be a performance bottleneck. For APIs that allow users to define their own key names this can further cause false positives where intended literals are replaced by pre-computed task results.

And I think this was also introduced with the work started in dask-expr, so query optimizations.

I’m not sure about this question. This shouldn’t have any impact on the use of Dask APIs or your usage of Dask.

guillaumeeb · May 23, 2025, 2:51pm

cc @fjetter @Patrick

astrojuanlu · May 23, 2025, 3:21pm

Uh how embarrasing, I missed that section every time. Thanks for highlighting it as well as the extra context!

Topic		Replies	Views
Sciline: Declarative approach to building a task graph Showcase	0	160	July 31, 2023
Dask team releases versions 2022.05.1 and 2022.05.2 Announcements	0	272	May 27, 2022
Dask team releases version 2022.04.2 Announcements	0	441	May 2, 2022
Provenance tracking with Dask delayed	2	215	March 24, 2023
Dask team releases version 2022.04.1 Announcements	0	268	April 18, 2022

Background on new graph specification

Related topics