Ideal way to create parquet part files limited for size?

bozden · August 6, 2024, 11:23pm

What is the best method you use to create parquet part files limited in file size (e.g. 1 GB)?

I currently use multiprocessing + pyarrow and using a statistical compression ratio I’ve got from a sample I calculate max number of records in a file, and create files as batch_nn_part_mm.parquet.

As a newbie in Dask, I wonder if there is a safe and/or better way for it in Dask.

guillaumeeb · August 15, 2024, 3:27pm

Hi!

There are some infomations in the documentation:

Dask will write one file per Dask dataframe partition to this directory. To optimize access for downstream consumers, we recommend aiming for an in-memory size of 100-300 MiB per partition. This helps balance worker memory usage against Dask overhead. You may find the DataFrame.memory_usage_per_partition() method useful for determining if your data is partitioned optimally.

So the idea is to partition your dataset correctly.

bozden · August 16, 2024, 1:05am

Thank you @guillaumeeb, I read that beforehand but did not experiment yet. Until now, I converted several python tasks without involving the parquet format to be able to see the results.

In my first implementations it hardly used CPU and RAM as it mostly IO bound (e.g. reading a TB’s of data from mechanical drive, low level analyzing and writing out the results into .tsv files).

This is an open-source project which analyses voice-AI datasets (Mozilla Common Voice and others). Unfortunately the partitioning is somewhat pre-set, there are 120+ languages and new data comes out every 3 months. I partition it as <dataset>/<language>/<version> to prevent colliding and the size depends on the amount of volunteered recordings. And with each release, some languages get many, some none - so it will be unbalanced by definition, a problem I could not resolve…

I think 100-300 MiB in-memory size is a bit small for my purposes. For textual metadata and arrow-converted audio (which can compress to 10-20%) that would create too many small files.

Is the Dask overhead the document talks about related to network communication or IPC? I don’t plan to go to cloud, but definitely want to scale it to local LAN.

guillaumeeb · August 16, 2024, 8:00am

Dask overhead is mentionned in several part of the documentation, like here, or here. Overhead is mainly task scheduling overhead, but network can also be involved of course.

Recommandations are a starting point, but of course it must be adapted to your use case. You can definitly aim at bigger partitions!!

bozden · August 16, 2024, 9:04am

Oh, OK, that’s fine, I already had my own/custom scheduler (removed now).
I usually have long running tasks, currently max ~1500 tasks. My task graphs are full blue.

Thank you again. Now I can close this…

Topic		Replies	Views
Memory issues arising from writing partitions with to_parquet	5	755	September 18, 2023
Local cluster unable to handle larger-than-memory parquet file Distributed	1	115	February 28, 2024
Large graph warning Dask DataFrame	5	16	July 29, 2025
Distributed client on K8 OOM issue Distributed kubernetes	8	253	April 22, 2022
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1713	April 6, 2023

Ideal way to create parquet part files limited for size?

Related topics