I’m pretty new to Dask (a couple of days), so please bear with me… After analyzing Dask, I think it fits very well for my use case (currently using multiprocessing, pandas, pyarrow).
I have a process which adds new records to partitioned parquet using pyarrow. As pyarrow has problems with updating existing data, I need to version them create a new file - to overcome the overhead of read-update-write cycle. But sometimes the data is very small and might create very small files.
Dask parquet has append mode, how does it work? Does it do read-update-write internally or just append?
Append should not over-write. New data will become extra files (following the partitioning folder structure, if it exists) without reading old ones - only the naming convention is needed. In there is a _metadata file, it will be updated to include the new row-groups - this is indeed a rewrite, but of only one file. arrow (and dask) will not make a _metadata file unless explicitly requested.
So, the best option for me: Write an optimize method which I can run e.g. yearly, which combines files and re-creates them, possibly leaving out the last level (i.e. version) in the hive… Too many small files create quite an overhead when querying the data because of IO overhead.