Distributed runner for Beam

rabernat · June 7, 2022, 12:59pm

Folks in the Pangeo community have been experimenting with the Apache Beam programming model for distributed computing. There is growing interested in implementing a Dask Distributed Runner for Beam (e.g. Support for a Dask runner · Issue #18962 · apache/beam · GitHub). This would make it a lot easier for people with existing Dask infrastructure to use Beam. Beam can already run on many different distributed computing backends (Spark, Flink, Dataflow, etc.) There is an extensive Beam Runner Authoring Guide (Google it, Discourse won’t let me post more than 2 links). So implementing a Dask runner is a tightly scoped and completely feasible engineering task.

In order to kickstart the discussion of implementing a Dask Beam runner, I propose we meet during the week of June 13-17. I have created a When2Meet Poll here - Dask Beam Runner Discussion - When2meet . If you are interested in attending, please give your availability. Hope to see you at the meeting!

rabernat · June 8, 2022, 2:14pm

It would be great if we could convince a Dask maintainer to attend this meeting. Optimistically tagging @gjoseph92 and @jrbourbeau.

rabernat · June 13, 2022, 3:52pm

Thanks to all who replied! We have scheduled the call for Wed June 15, 1:30 pm ET. The zoom link is Launch Meeting - Zoom

Looking forward to the discussion!

rabernat · October 20, 2023, 7:01pm

For folks funding this thread via Google search I thought I would share some progress on the Beam Dask Runner.

The initial Dask Runner was implemented in Beam in Initial DaskRunner for Beam by alxmrs · Pull Request #22421 · apache/beam · GitHub; the code lives here:

Here’s a PyData talk from Alex Merose about the work

The Dask runner remains relatively immature and untested; there are important beam features that are still unimplemented (see open Beam issues.

We (Pangeo Forge project) are optimistic about the prospects of running Beam pipelines on Dask and would love to see more development happen in this area.

scharlottej13 · January 19, 2024, 12:20am

Sharing Charles Stern’s demo from Dask demo day earlier today: https://www.youtube.com/watch?v=wkQzVNQdgW0&t=48s

Topic		Replies	Views
Custom counters/metrics with Dask distributed?	2	194	September 1, 2023
Potential removal of Dask Executor support in Airflow Distributed	0	976	March 5, 2022
Airflow and Dask Distributed distributed	3	845	January 17, 2022
Dask on Databricks Clusters Deploying Dask	12	695	November 2, 2023
Planned removal of the "daskexecutor" provider in Airflow Meta	3	278	November 17, 2023

Distributed runner for Beam

Related topics