Provenance tracking with Dask

Hi!

I am looking into different ways of building data analysis workflows for neutron scattering. There are a bunch of requirements and it currently looks like some kind of graph-based implementation would be well suited. Dask’s task-graphs might be a good fit. In particular by using the delayed interface, possibly in combination with bags and manually constructed graphs.

One requirement is to track provenance of the analysed data. A task-graph could in principle serve as a provenance record. However, there are some complications.

  • Dask’s graphs often contain implementation details like nodes for handling chunking. Those are irrelevant for provenance.
  • It can be tricky to identify parameters set by a user.
  • All history is lost when computing a result. This is especially bad when intermediate results need to be computed as this splits the graph into disconnected pieces.

Has anyone looked into using Dask for provenance tracking? Are there maybe even packages out there that do it?

(Currently, performance and running on multiple nodes is not an issue. But this may change in the future as detectors grow and collect more data.)

Hi @Jankas, welcome to this forum!

Well, I’m sorry to say I haven’t got a lot to answer here. I’m not familiar with provenance tracking at all.

Maybe the HighLevelGraph representation would be better for your need?

Could you explain a bit more what would you want to identify?

I don’t see what can be done here, Dask don’t want to keep intermediate results into memory…

Thanks @guillaumeeb!

This might indeed help in some cases. I’d have to try it out for a concrete example. But I’m not at that point yet. This is anyway a minor point.

I would like to keep track of inputs to a workflow. For example, given a result, I would like to see the file name of an input or the value of a specific parameter. As far as I can tell, I can find those by walking the task graph and everything in the function arguments that is not a string is an input as well as every string that is not used as a key.
So this is a solvable problem. But I was wondering if there is a more straight forward way to identify inputs vs intermediate results.

This is not directly about intermediate results. It is more about tracking what computations were done to obtain a result.
We need a certain amount of interactivity. So it does not seem feasible to encode the entire workflow as a single graph. Instead there would be separate sub graphs, say, a and b where we compute a result of a and feed that as an input to b. But this process loses the history of the intermediate result.
This can be solved by storing a in the intermediate result along with the data and gluing it onto b at the end.

So basically, I think this is doable overall. But it requires a lot of work. I was hoping that I’m not the first to try it with Dask and could leverage some existing codebase or insights.

1 Like