Poor performance with Parquet data vs DuckDB

gugat · July 29, 2024, 3:36pm

I have issues with Dask on a problem that seems easy and where DuckDB succeeds.

I have a single Linux machine for local execution with 32GB RAM and 10GB of Parquet files that all have the same schema of about 10 columns. I need to do a group by over 2 column and sum - so, a typical group by scenario. Note, since I am only using 3 columns, only a subset of the 10GB Parquet data is needed for this computation.

With Dask RAM fills progressively and after a few minutes all RAM is full and the process crashes. If I first initialise a LocalCluster, I see INFO logs that the “event loop was unresponsive in Nanny […] often caused by long-running GIL-holding functions or moving large chunks of data”

With DuckDB: completes in a few seconds, with minimal RAM used. This is what I would have expected. Minimal code - first time use of DuckDB.

What could I be missing?

guillaumeeb · July 31, 2024, 7:51am

Hi @gugat, welcome to Dask community!

It would be much easier to answer with a minimal reproducible example, would you be able to provide this? At least, give some code example both with Dask and DuckDB.

That said, from your numbers, did you try with Pandas alone? What is heppening? It is not necessarily unexpected that DuckDB is better in this case. Still, you shouldn’t run into any RAM issues with Dask.

Topic		Replies	Views
Out of memory issue while running dask code (that works outside the container) via a docker containder	9	616	March 19, 2023
Dask computation takes way too much memory Dask DataFrame distributed	5	1048	December 27, 2023
Cannot calculate simple .mean() on dask.dataframe larger than RAM Dask DataFrame	2	444	January 16, 2023
Local cluster unable to handle larger-than-memory parquet file Distributed	1	116	February 28, 2024
Bad performance while training model from SQL data using Dask cluster Distributed distributed , dask-ml	2	48	March 19, 2025

Poor performance with Parquet data vs DuckDB

Related topics