Thank you @guillaumeeb, I read that beforehand but did not experiment yet. Until now, I converted several python tasks without involving the parquet format to be able to see the results.
In my first implementations it hardly used CPU and RAM as it mostly IO bound (e.g. reading a TB’s of data from mechanical drive, low level analyzing and writing out the results into .tsv
files).
This is an open-source project which analyses voice-AI datasets (Mozilla Common Voice and others). Unfortunately the partitioning is somewhat pre-set, there are 120+ languages and new data comes out every 3 months. I partition it as <dataset>/<language>/<version>
to prevent colliding and the size depends on the amount of volunteered recordings. And with each release, some languages get many, some none - so it will be unbalanced by definition, a problem I could not resolve…
I think 100-300 MiB in-memory size is a bit small for my purposes. For textual metadata and arrow-converted audio (which can compress to 10-20%) that would create too many small files.
Is the Dask overhead
the document talks about related to network communication or IPC? I don’t plan to go to cloud, but definitely want to scale it to local LAN.