Pivot_table doesnt work same as pandas

consider the following pandas dataframe:

import pandas as pd
import dask.dataframe as dd
import numpy as np

df = pd.DataFrame(
    {
        "A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
        "B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
        "C": [1, 2, 2, 3, 3, 4, 5, 6, 7],
    }
)

running the following:

df.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)

results:

A B C
0 bar one 4.5
1 bar two 6.5
2 foo one 2.0
3 foo two 3.0

Which is the require result. However, when running this with dask dataframe it doesn’t go through:

ddf = dd.from_pandas(df, npartitions=3)
ddf.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)

results:
ValueError: 'index' must be the name of an existing column

seems like the DD implementation is rather limited to scalars (dask.dataframe.reshape.pivot_table — Dask documentation)
Is there another way to achieve this?

Hi @jadeidev,

Not exactly the same as you’ll get a Series instead of a DataFrame, but you can still get the same results with:

res = ddf.groupby(["A", "B"]).C.median()
# Optional, depends on what you want to do
pd_series = res.compute()

Does that help?