Using pivot_table with non-numerical data

What would be the best way to do a pivot_table on a dataframe containing non-numerical data? I can do it with pandas for smaller dataframes, but my current data is too large to fit in memory.


raw_data_pd = pd.DataFrame({"ID": [1, 1], "Col_ID": ["A", "B"], "value": ["Alice", "Bob"]})
pivoted_data_pd = raw_data_pd.pivot_table(index="ID", columns="Col_ID", values="value", aggfunc="first")

This would produce the desired output

Col_ID      A    B
1       Alice  Bob

Trying the same with dask does not work as pivot_table in dask.dataframe does not support aggfunc=“first” but needs “mean”, “sum” or “count” and thus numerical data.

raw_data_dd = dd.from_pandas(raw_data_pd, npartitions=1)
pivoted_data_dd = raw_data_dd.categorize(columns=["Col_ID"]).pivot_table(index="ID", columns="Col_ID", values="value")

Is there any way I can circumvent this problem, except for chunking my original file and use pandas on the chunks?

Hi @nordange and welcome! You can get close with something like this, if you don’t need a multiindex:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({"ID": [1, 1, 1, 1], "Col_ID": ["A", "A", "B", "B"], "value": ["Alice", "Bob", "Susan", "Kim"]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.groupby(['Col_ID', 'ID']).first().compute()

However, there is still work being done to fully support a multi-index in Dask DataFrame. You might also find these docs on various aggregation strategies helpful.


Thank you very much, @scharlottej13.

Yes, that looks similar to what I need, except for the multiindex.

Also, I forgot to mention: For this use case I know that I do not have multiple values for each ID and Col_ID, so actually I do not need any aggregation as such, but pandas’ pivot_table has the very convenient aggunc=“first” which works for non-numeric data.

@nordange no problem! I agree that Pandas’ pivot_table makes it much more convenient to work with non-numeric data. I would encourage you to open up an issue for a feature request for Dask’s pivot_table to support first and last!

Great suggestion! Thank you for your support.