Local cluster unable to handle larger-than-memory parquet file

seydar · February 23, 2024, 9:35pm

I’m posting my bug report here (Local Cluster Unable to Handle Smaller-than-Memory Parquet File · Issue #8383 · dask/distributed · GitHub) because it might just be a matter of “how do I better organize my data”.

My parquet file (s3://ari-public-test-data/test1) is 266 MB on disk and 1017 MB in memory (according to memory_usage()). It’s got 32 row groups of 10 MB each.

I’d love some advice on how to better set this up. I’m ultimately trying to handle larger-than-memory datasets, so I am trying to figure it out at a small scale first.

guillaumeeb · February 28, 2024, 2:25pm

I @seydar, welcome to Dask community!

I see that you’ve already had several answers from @fjetter on github. Could you be a bit more specific on which answers you’re searching for here? Do you want some advice on how to write the Parquet file to fit your use case? If so, what is your use case, what computations to you intend to do on this dataset, does it involve grouping, sorting?

Could you post also the reproducer here, so it is easier to follow?

Topic		Replies	Views
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1726	April 6, 2023
Memory issues arising from writing partitions with to_parquet	5	809	September 18, 2023
Distributed client on K8 OOM issue Distributed kubernetes	8	257	April 22, 2022
Ideal way to create parquet part files limited for size? parquet	4	207	August 16, 2024
Slow processing of parquet dataset using the distributed client Dask DataFrame distributed	1	380	October 11, 2022

Local cluster unable to handle larger-than-memory parquet file

Related topics