Dask flow manager?

We have an architecture problem in our Pangeo Forge project that I would like to run by this community.

Some short context: Pangeo Forge is an ETL framework for multi-dimensional array data (think xarray as the core data mode). At its core is a python library pangeo_forge_recipes, which works by building an abstract representation of an ETL pipeline (a “recipe”) and then “compiling” it to a range of different executor frameworks. We currently support Dask, Prefect, and Apache Beam as executors. We have no interested in making our own distributed processing engine.

In addition to the python package itself, we are also building basically a cloud CI service that automatically deploys these pipelines from github repos. We have been using Prefect + Prefect Cloud as the executor for this, because of its robust and well documented deployment and orchestration capabilities. The key feature is the ability to register a flow, ship it off to a backend service (“Bakery” in Pangeo Forge terminology), and monitor the status of the flow.

However, Prefect can be a pretty heavy dependency, and comes with lots of bells and whistles that we don’t always need. And the the Pangeo Forge Dask executor generally works great. However, it has to be launched manually.

We don’t really see how to serialize and ship a dask graph out to backend service and then query its status at a later time. (We don’t need the actual results of the computation; everything is happening through side effects.) Basically, we need some kind of higher level dask flow manager / meta scheduler, replacing the high-level Flow orchestration part of Prefect but without the need for task-level visibility.

This seems like it must be a relatively common need in the Dask ecosystem. Any ideas?

1 Like

@rabernat Welcome to the Dask Discourse, and thanks for creating this topic!

Jim Crist mentioned you both had a chat about this, and you might share your notes here. That would be super helpful – a major reason for starting this Discourse forum was to create a searchable knowledge base here, so thank you for adding to it! :sparkles:

1 Like

Hi @rabernat. When I see this sentence, I’d say you’d just need a Dask Client that is able to submit a computation, and this computation would be identified such that another Client, or a Client created at a later time, could retrieve its status and result? A bit like the future API, but for entire graphs? And we no notion of Python object that point to futures…

1 Like