Pivot_table doesnt work same as pandas

jadeidev · May 22, 2023, 10:31pm

consider the following pandas dataframe:

import pandas as pd
import dask.dataframe as dd
import numpy as np

df = pd.DataFrame(
    {
        "A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
        "B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
        "C": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    }
)

running the following:

df.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)

results:

	A	B	C
0	bar	one	4.5
1	bar	two	6.5
2	foo	one	2.0
3	foo	two	3.0

Which is the require result. However, when running this with dask dataframe it doesn’t go through:

ddf = dd.from_pandas(df, npartitions=3)
ddf.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)

results:
ValueError: 'index' must be the name of an existing column

seems like the DD implementation is rather limited to scalars (dask.dataframe.reshape.pivot_table — Dask documentation)
Is there another way to achieve this?

guillaumeeb · May 30, 2023, 12:55pm

Hi @jadeidev,

Not exactly the same as you’ll get a Series instead of a DataFrame, but you can still get the same results with:

res = ddf.groupby(["A", "B"]).C.median()
# Optional, depends on what you want to do
pd_series = res.compute()

Does that help?

Topic		Replies	Views
Using pivot_table with non-numerical data Dask DataFrame	4	974	January 25, 2022
Pivot_table very slow Distributed	1	516	June 28, 2022
Dataframe indexes Dask DataFrame	3	857	June 16, 2022
Inconsistencies with Dask Columns & Indices Dask DataFrame	5	29	January 31, 2025
Maintaining index between .values and .to_dask_dataframe Dask DataFrame	3	130	February 23, 2024

Pivot_table doesnt work same as pandas

Related topics