Why does dask take long time to compute regardless of the size of dataframe and partitions

@jhanv Welcome to Discourse!

Just to note, I see you asked this on Stack Overflow as well.

Dask isn’t aware of the shape of your DataFrame. In fact, it just knows the number of “partitions”. So, the number of elements in each partition are calculated when you call .compute(), which does take some time. This is one of the reasons why it’s recommended to call .compute() or .persist() very minimally, towards to end.

Here’s an example task graph:

import pandas as pd
import dask.dataframe as dd
from distributed import Client

client = Client()

df = pd.DataFrame({
    'a': range(1000),
    'b': range(1000),
})
ddf = dd.from_pandas(df, npartitions=2)

ddf.shape[0].visualize()

Let me know if this helps!