Dask Distributed Dataframe Persist (synchronous vs asynchronous scheduler)


I would like to better understand the behaviour of persist() in a distributed setting (on K8S). I am using K8S Operator. I have around 30 fairly large workers and scheduler (4x32) orchestrating the Dask Cluster.

I have tried both API — Dask.distributed 2023.8.0+28.ga356fb8 documentation and dask.dataframe.Series.persist — Dask documentation. Both of these return without blocking the function call. In other words, they are asynchronous in nature in this distributed setting.

Both of these do help subsequent transformations if completed in full, however, we wanted more deterministic behavior by making a blocking/synchrnous call to persist all the data since we are working with over 1TB of data and it takes 30-40 mins to complete entirely.

Making sure that we can deterministically persist data, makes it easier to continue with the following transformations and save the file into a parquet file.


Hi @jerrygb,

I think this has been answered in this stackoverflow question:

Just use dask.distributed.wait.