I’m posting my bug report here (Local Cluster Unable to Handle Smaller-than-Memory Parquet File · Issue #8383 · dask/distributed · GitHub) because it might just be a matter of “how do I better organize my data”.
My parquet file (s3://ari-public-test-data/test1) is 266 MB on disk and 1017 MB in memory (according to memory_usage()
). It’s got 32 row groups of 10 MB each.
I’d love some advice on how to better set this up. I’m ultimately trying to handle larger-than-memory datasets, so I am trying to figure it out at a small scale first.