Why does dask take long time to compute regardless of the size of dataframe and partitions

pavithraes · March 31, 2022, 5:56pm

@jhanv Welcome to Discourse!

Just to note, I see you asked this on Stack Overflow as well.

Dask isn’t aware of the shape of your DataFrame. In fact, it just knows the number of “partitions”. So, the number of elements in each partition are calculated when you call .compute(), which does take some time. This is one of the reasons why it’s recommended to call .compute() or .persist() very minimally, towards to end.

Here’s an example task graph:

import pandas as pd
import dask.dataframe as dd
from distributed import Client

client = Client()

df = pd.DataFrame({
    'a': range(1000),
    'b': range(1000),
})
ddf = dd.from_pandas(df, npartitions=2)

ddf.shape[0].visualize()

Let me know if this helps!

Topic		Replies	Views
How to handle a Dask DF in multiple modules? Dask DataFrame	6	606	February 8, 2023
Dask.read_sql_table() too slow Distributed	7	391	June 16, 2023
Memory filled up when compute dataframe-mean with 67 million rows Dask DataFrame	1	325	March 1, 2022
Bad performance with dask in k8s?	1	427	May 12, 2022
Large graph warning Dask DataFrame	6	305	August 6, 2025

Why does dask take long time to compute regardless of the size of dataframe and partitions

Related topics