I have issues with Dask on a problem that seems easy and where DuckDB succeeds.
I have a single Linux machine for local execution with 32GB RAM and 10GB of Parquet files that all have the same schema of about 10 columns. I need to do a group by over 2 column and sum - so, a typical group by scenario. Note, since I am only using 3 columns, only a subset of the 10GB Parquet data is needed for this computation.
With Dask RAM fills progressively and after a few minutes all RAM is full and the process crashes. If I first initialise a LocalCluster, I see INFO logs that the “event loop was unresponsive in Nanny […] often caused by long-running GIL-holding functions or moving large chunks of data”
With DuckDB: completes in a few seconds, with minimal RAM used. This is what I would have expected. Minimal code - first time use of DuckDB.
What could I be missing?