Ideal way to create parquet part files limited for size?

bozden · August 16, 2024, 1:05am

Thank you @guillaumeeb, I read that beforehand but did not experiment yet. Until now, I converted several python tasks without involving the parquet format to be able to see the results.

In my first implementations it hardly used CPU and RAM as it mostly IO bound (e.g. reading a TB’s of data from mechanical drive, low level analyzing and writing out the results into .tsv files).

This is an open-source project which analyses voice-AI datasets (Mozilla Common Voice and others). Unfortunately the partitioning is somewhat pre-set, there are 120+ languages and new data comes out every 3 months. I partition it as <dataset>/<language>/<version> to prevent colliding and the size depends on the amount of volunteered recordings. And with each release, some languages get many, some none - so it will be unbalanced by definition, a problem I could not resolve…

I think 100-300 MiB in-memory size is a bit small for my purposes. For textual metadata and arrow-converted audio (which can compress to 10-20%) that would create too many small files.

Is the Dask overhead the document talks about related to network communication or IPC? I don’t plan to go to cloud, but definitely want to scale it to local LAN.

Topic		Replies	Views
Memory issues arising from writing partitions with to_parquet	5	735	September 18, 2023
Local cluster unable to handle larger-than-memory parquet file Distributed	1	115	February 28, 2024
Quick Q on dask parquet append Dask DataFrame parquet	5	36	August 16, 2024
Understanding partitions, groupby, and memory usage Dask DataFrame groupby , aggregation , partitioning	1	1334	February 15, 2024
Customizing `to_parquet()`: split one partition into many parquet files based on certain criteria Dask DataFrame	5	361	April 26, 2022

Ideal way to create parquet part files limited for size?

Related topics