Quick Q on dask parquet append

I’m pretty new to Dask (a couple of days), so please bear with me… After analyzing Dask, I think it fits very well for my use case (currently using multiprocessing, pandas, pyarrow).

I have a process which adds new records to partitioned parquet using pyarrow. As pyarrow has problems with updating existing data, I need to version them create a new file - to overcome the overhead of read-update-write cycle. But sometimes the data is very small and might create very small files.

Dask parquet has append mode, how does it work? Does it do read-update-write internally or just append?

Hi @bozden, welcome to Dask Discourse forum!!

I think if done correctly, append mode doesn’t need to overwrite the entire dataset, but I’m not entirely sure. @martindurant surely knows about this.

Append should not over-write. New data will become extra files (following the partitioning folder structure, if it exists) without reading old ones - only the naming convention is needed. In there is a _metadata file, it will be updated to include the new row-groups - this is indeed a rewrite, but of only one file. arrow (and dask) will not make a _metadata file unless explicitly requested.

2 Likes

Thank you @martindurant & @guillaumeeb.

So in this respect dask is a wrapper, without changing the original pyarrow functionality.

Hmm. So I can more easily adapt the old code and my small-file problem is not resolved :slight_smile:

1 Like

dask is a wrapper

Essentially - but each partition becomes a different file or set of files, and this parallelism is, after all, the point of dask.

2 Likes

So, the best option for me: Write an optimize method which I can run e.g. yearly, which combines files and re-creates them, possibly leaving out the last level (i.e. version) in the hive… Too many small files create quite an overhead when querying the data because of IO overhead.

1 Like