Poor performance with Parquet data vs DuckDB

I have issues with Dask on a problem that seems easy and where DuckDB succeeds.

I have a single Linux machine for local execution with 32GB RAM and 10GB of Parquet files that all have the same schema of about 10 columns. I need to do a group by over 2 column and sum - so, a typical group by scenario. Note, since I am only using 3 columns, only a subset of the 10GB Parquet data is needed for this computation.

With Dask RAM fills progressively and after a few minutes all RAM is full and the process crashes. If I first initialise a LocalCluster, I see INFO logs that the “event loop was unresponsive in Nanny […] often caused by long-running GIL-holding functions or moving large chunks of data”

With DuckDB: completes in a few seconds, with minimal RAM used. This is what I would have expected. Minimal code - first time use of DuckDB.

What could I be missing?

Hi @gugat, welcome to Dask community!

It would be much easier to answer with a minimal reproducible example, would you be able to provide this? At least, give some code example both with Dask and DuckDB.

That said, from your numbers, did you try with Pandas alone? What is heppening? It is not necessarily unexpected that DuckDB is better in this case. Still, you shouldn’t run into any RAM issues with Dask.