Hi @mrchtr, welcome to Dask community!
I’m a bit surprised by this, this should not be the case! In Improving pipeline resilience when using `to_parquet` and preemptible workers, we identified that the tasks results from to_parquet
were kept in memory until the last write, but this is only the metadata needed for the final step, not the data to write. The dataset should be cleared from memory chunk by chunk when the writes occur.
Maybe there’s a catch in this workflow I’m not seeing currently. Could you try to built a minimal reproducer?