@jhanv Welcome to Discourse!
Just to note, I see you asked this on Stack Overflow as well.
Dask isn’t aware of the shape of your DataFrame. In fact, it just knows the number of “partitions”. So, the number of elements in each partition are calculated when you call .compute(), which does take some time. This is one of the reasons why it’s recommended to call .compute() or .persist() very minimally, towards to end.
Here’s an example task graph:
import pandas as pd
import dask.dataframe as dd
from distributed import Client
client = Client()
df = pd.DataFrame({
'a': range(1000),
'b': range(1000),
})
ddf = dd.from_pandas(df, npartitions=2)
ddf.shape[0].visualize()
Let me know if this helps!
