Distributed runner for Beam

Folks in the Pangeo community have been experimenting with the Apache Beam programming model for distributed computing. There is growing interested in implementing a Dask Distributed Runner for Beam (e.g. Support for a Dask runner · Issue #18962 · apache/beam · GitHub). This would make it a lot easier for people with existing Dask infrastructure to use Beam. Beam can already run on many different distributed computing backends (Spark, Flink, Dataflow, etc.) There is an extensive Beam Runner Authoring Guide (Google it, Discourse won’t let me post more than 2 links). So implementing a Dask runner is a tightly scoped and completely feasible engineering task.

In order to kickstart the discussion of implementing a Dask Beam runner, I propose we meet during the week of June 13-17. I have created a When2Meet Poll here - Dask Beam Runner Discussion - When2meet . If you are interested in attending, please give your availability. Hope to see you at the meeting!

1 Like

It would be great if we could convince a Dask maintainer to attend this meeting. Optimistically tagging @gjoseph92 and @jrbourbeau.

Thanks to all who replied! We have scheduled the call for Wed June 15, 1:30 pm ET. The zoom link is Launch Meeting - Zoom

Looking forward to the discussion!

1 Like

For folks funding this thread via Google search I thought I would share some progress on the Beam Dask Runner.

The initial Dask Runner was implemented in Beam in Initial DaskRunner for Beam by alxmrs · Pull Request #22421 · apache/beam · GitHub; the code lives here:

Here’s a PyData talk from Alex Merose about the work

The Dask runner remains relatively immature and untested; there are important beam features that are still unimplemented (see open Beam issues.

We (Pangeo Forge project) are optimistic about the prospects of running Beam pipelines on Dask and would love to see more development happen in this area.

Sharing Charles Stern’s demo from Dask demo day earlier today: https://www.youtube.com/watch?v=wkQzVNQdgW0&t=48s