What is the best method you use to create parquet part files limited in file size (e.g. 1 GB)?
I currently use multiprocessing + pyarrow and using a statistical compression ratio I’ve got from a sample I calculate max number of records in a file, and create files as batch_nn_part_mm.parquet.
As a newbie in Dask, I wonder if there is a safe and/or better way for it in Dask.
Dask will write one file per Dask dataframe partition to this directory. To optimize access for downstream consumers, we recommend aiming for an in-memory size of 100-300 MiB per partition. This helps balance worker memory usage against Dask overhead. You may find the DataFrame.memory_usage_per_partition() method useful for determining if your data is partitioned optimally.
So the idea is to partition your dataset correctly.
Thank you @guillaumeeb, I read that beforehand but did not experiment yet. Until now, I converted several python tasks without involving the parquet format to be able to see the results.
In my first implementations it hardly used CPU and RAM as it mostly IO bound (e.g. reading a TB’s of data from mechanical drive, low level analyzing and writing out the results into .tsv files).
This is an open-source project which analyses voice-AI datasets (Mozilla Common Voice and others). Unfortunately the partitioning is somewhat pre-set, there are 120+ languages and new data comes out every 3 months. I partition it as <dataset>/<language>/<version> to prevent colliding and the size depends on the amount of volunteered recordings. And with each release, some languages get many, some none - so it will be unbalanced by definition, a problem I could not resolve…
I think 100-300 MiB in-memory size is a bit small for my purposes. For textual metadata and arrow-converted audio (which can compress to 10-20%) that would create too many small files.
Is the Dask overhead the document talks about related to network communication or IPC? I don’t plan to go to cloud, but definitely want to scale it to local LAN.
Dask overhead is mentionned in several part of the documentation, like here, or here. Overhead is mainly task scheduling overhead, but network can also be involved of course.
Recommandations are a starting point, but of course it must be adapted to your use case. You can definitly aim at bigger partitions!!
Oh, OK, that’s fine, I already had my own/custom scheduler (removed now).
I usually have long running tasks, currently max ~1500 tasks. My task graphs are full blue.