Best practice for safely overwriting lazy-loaded data

guillaumeeb · October 27, 2023, 4:46pm

Just to be sure, are you talking about updating just part of the dataset, or just replacing the whole file is enough? According to the rest of your post, I assume replacing the whole file is OK.

As those file formats were designed to be immutable, the workflow I would suggest is close to the one you’re proposing: just write in another location and move the data once you are happy with the result.

I’m not sure about Parquet, but I think there has been work to be able to append or even modify just part of Zarr arrays, at least this should be feasible.

In the example you provide, I think you are replacing z.

How is chosen this directory? How is the user aware of it?

Why do you say it’s designed for single-threaded? This is a strong limitation for big data file formats.

In step 4 and 5, copying the data can be expensive, this is mandatory on object storage, but moving the data should be considered on POSIX file systems.

Well, I don’t think there is a much better solution to safely overwrite Parquet and Zarr datasets. However, the solution is essentially: write somewhere else, and replace the dataset after that. I don’t think implementing this on Dask side is really useful, because the optimal way to do this will depend on the storage infrastructure, the data volume, and probably other user’s considerations.

cc @martindurant.

Topic		Replies	Views
Quick Q on dask parquet append Dask DataFrame parquet	5	39	August 16, 2024
Using da.delayed for Zarr processing: memory overhead & how to do it better? Dask Array dask-array , delayed	15	1862	January 16, 2024
Improving pipeline resilience when using `to_parquet` and preemptible workers Dask DataFrame distributed	5	448	August 25, 2023
`da.zeros_like` instanciated arrays are not writeable when `map_blocks` is run over them Dask Array	5	173	May 10, 2024
Delayed dataframe computation Distributed dask-array , xarray , distributed	2	505	April 28, 2022

Best practice for safely overwriting lazy-loaded data

Related topics