Local cluster unable to handle larger-than-memory parquet file

I’m posting my bug report here (Local Cluster Unable to Handle Smaller-than-Memory Parquet File · Issue #8383 · dask/distributed · GitHub) because it might just be a matter of “how do I better organize my data”.

My parquet file (s3://ari-public-test-data/test1) is 266 MB on disk and 1017 MB in memory (according to memory_usage()). It’s got 32 row groups of 10 MB each.

I’d love some advice on how to better set this up. I’m ultimately trying to handle larger-than-memory datasets, so I am trying to figure it out at a small scale first.

I @seydar, welcome to Dask community!

I see that you’ve already had several answers from @fjetter on github. Could you be a bit more specific on which answers you’re searching for here? Do you want some advice on how to write the Parquet file to fit your use case? If so, what is your use case, what computations to you intend to do on this dataset, does it involve grouping, sorting?

Could you post also the reproducer here, so it is easier to follow?