Dask Distributed Dataframe Persist (synchronous vs asynchronous scheduler)

jerrygb · August 18, 2023, 1:56pm

Hi,

I would like to better understand the behaviour of persist() in a distributed setting (on K8S). I am using K8S Operator. I have around 30 fairly large workers and scheduler (4x32) orchestrating the Dask Cluster.

I have tried both API — Dask.distributed 2023.8.0+28.ga356fb8 documentation and dask.dataframe.Series.persist — Dask documentation. Both of these return without blocking the function call. In other words, they are asynchronous in nature in this distributed setting.

Both of these do help subsequent transformations if completed in full, however, we wanted more deterministic behavior by making a blocking/synchrnous call to persist all the data since we are working with over 1TB of data and it takes 30-40 mins to complete entirely.

Making sure that we can deterministically persist data, makes it easier to continue with the following transformations and save the file into a parquet file.

Thanks

guillaumeeb · August 18, 2023, 5:04pm

Hi @jerrygb,

I think this has been answered in this stackoverflow question:

Just use dask.distributed.wait.

Topic		Replies	Views
Saving large dask arrays one block at a time, without first persisting in memory Dask Array dask-array , distributed	2	866	April 27, 2023
Dask Local Distributed vs Dataframe Distributed	1	54	August 28, 2024
Dask on ray .persist() does not work with dask dataframes Dask DataFrame	2	162	February 2, 2024
Dask data sharding future	9	568	January 25, 2022
Gradually build up a Dataframe Dask DataFrame	2	1130	July 14, 2022

Dask Distributed Dataframe Persist (synchronous vs asynchronous scheduler)

Related topics