We have an architecture problem in our Pangeo Forge project that I would like to run by this community.
Some short context: Pangeo Forge is an ETL framework for multi-dimensional array data (think xarray as the core data mode). At its core is a python library pangeo_forge_recipes, which works by building an abstract representation of an ETL pipeline (a “recipe”) and then “compiling” it to a range of different executor frameworks. We currently support Dask, Prefect, and Apache Beam as executors. We have no interested in making our own distributed processing engine.
In addition to the python package itself, we are also building basically a cloud CI service that automatically deploys these pipelines from github repos. We have been using Prefect + Prefect Cloud as the executor for this, because of its robust and well documented deployment and orchestration capabilities. The key feature is the ability to register a flow, ship it off to a backend service (“Bakery” in Pangeo Forge terminology), and monitor the status of the flow.
However, Prefect can be a pretty heavy dependency, and comes with lots of bells and whistles that we don’t always need. And the the Pangeo Forge Dask executor generally works great. However, it has to be launched manually.
We don’t really see how to serialize and ship a dask graph out to backend service and then query its status at a later time. (We don’t need the actual results of the computation; everything is happening through side effects.) Basically, we need some kind of higher level dask flow manager / meta scheduler, replacing the high-level Flow orchestration part of Prefect but without the need for task-level visibility.
This seems like it must be a relatively common need in the Dask ecosystem. Any ideas?