Memory issues arising from writing partitions with to_parquet

guillaumeeb · September 5, 2023, 8:29am

Hi @mrchtr, welcome to Dask community!

I’m a bit surprised by this, this should not be the case! In Improving pipeline resilience when using `to_parquet` and preemptible workers, we identified that the tasks results from to_parquet were kept in memory until the last write, but this is only the metadata needed for the final step, not the data to write. The dataset should be cleared from memory chunk by chunk when the writes occur.

Maybe there’s a catch in this workflow I’m not seeing currently. Could you try to built a minimal reproducer?

Topic		Replies	Views
Improving pipeline resilience when using `to_parquet` and preemptible workers Dask DataFrame distributed	5	445	August 25, 2023
Ideal way to create parquet part files limited for size? parquet	4	178	August 16, 2024
Map_partitions just to execute and save per partition Dask DataFrame	0	468	September 28, 2022
Dask computation takes way too much memory Dask DataFrame distributed	5	1048	December 27, 2023
Is it necessary to call compute() before calling to_parquet()? Dask DataFrame	1	242	March 29, 2024

Memory issues arising from writing partitions with to_parquet

Related topics