Quick Q on dask parquet append

bozden · August 6, 2024, 11:14pm

I’m pretty new to Dask (a couple of days), so please bear with me… After analyzing Dask, I think it fits very well for my use case (currently using multiprocessing, pandas, pyarrow).

I have a process which adds new records to partitioned parquet using pyarrow. As pyarrow has problems with updating existing data, I need to version them create a new file - to overcome the overhead of read-update-write cycle. But sometimes the data is very small and might create very small files.

Dask parquet has append mode, how does it work? Does it do read-update-write internally or just append?

guillaumeeb · August 15, 2024, 3:13pm

Hi @bozden, welcome to Dask Discourse forum!!

I think if done correctly, append mode doesn’t need to overwrite the entire dataset, but I’m not entirely sure. @martindurant surely knows about this.

martindurant · August 15, 2024, 5:03pm

Append should not over-write. New data will become extra files (following the partitioning folder structure, if it exists) without reading old ones - only the naming convention is needed. In there is a _metadata file, it will be updated to include the new row-groups - this is indeed a rewrite, but of only one file. arrow (and dask) will not make a _metadata file unless explicitly requested.

bozden · August 16, 2024, 12:18am

Thank you @martindurant & @guillaumeeb.

So in this respect dask is a wrapper, without changing the original pyarrow functionality.

Hmm. So I can more easily adapt the old code and my small-file problem is not resolved

martindurant · August 16, 2024, 12:49am

dask is a wrapper

Essentially - but each partition becomes a different file or set of files, and this parallelism is, after all, the point of dask.

bozden · August 16, 2024, 1:32am

So, the best option for me: Write an optimize method which I can run e.g. yearly, which combines files and re-creates them, possibly leaving out the last level (i.e. version) in the hive… Too many small files create quite an overhead when querying the data because of IO overhead.

Topic		Replies	Views
How to reset x in name_function? Dask DataFrame	5	41	August 30, 2024
How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)? Dask DataFrame	0	204	October 17, 2022
KeyError while using the read_parquet method Dask DataFrame	10	1063	August 21, 2023
Improving pipeline resilience when using `to_parquet` and preemptible workers Dask DataFrame distributed	5	444	August 25, 2023
Memory issues arising from writing partitions with to_parquet	5	754	September 18, 2023

Quick Q on dask parquet append

Related topics