consider the following pandas dataframe:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = pd.DataFrame(
{
"A": ["foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two", "one", "one", "two", "two"],
"C": [1, 2, 2, 3, 3, 4, 5, 6, 7],
}
)
running the following:
df.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)
results:
A | B | C | |
---|---|---|---|
0 | bar | one | 4.5 |
1 | bar | two | 6.5 |
2 | foo | one | 2.0 |
3 | foo | two | 3.0 |
Which is the require result. However, when running this with dask dataframe it doesn’t go through:
ddf = dd.from_pandas(df, npartitions=3)
ddf.pivot_table(values="C", index=["A", "B"],aggfunc=np.median)
results:
ValueError: 'index' must be the name of an existing column
seems like the DD implementation is rather limited to scalars (dask.dataframe.reshape.pivot_table — Dask documentation)
Is there another way to achieve this?