Hi @LucaMarconato,
Just to be sure, are you talking about updating just part of the dataset, or just replacing the whole file is enough? According to the rest of your post, I assume replacing the whole file is OK.
As those file formats were designed to be immutable, the workflow I would suggest is close to the one you’re proposing: just write in another location and move the data once you are happy with the result.
I’m not sure about Parquet, but I think there has been work to be able to append or even modify just part of Zarr arrays, at least this should be feasible.
In the example you provide, I think you are replacing z
.
How is chosen this directory? How is the user aware of it?
Why do you say it’s designed for single-threaded? This is a strong limitation for big data file formats.
In step 4 and 5, copying the data can be expensive, this is mandatory on object storage, but moving the data should be considered on POSIX file systems.
Well, I don’t think there is a much better solution to safely overwrite Parquet and Zarr datasets. However, the solution is essentially: write somewhere else, and replace the dataset after that. I don’t think implementing this on Dask side is really useful, because the optimal way to do this will depend on the storage infrastructure, the data volume, and probably other user’s considerations.
cc @martindurant.