What would be the best way to do a pivot_table on a dataframe containing non-numerical data? I can do it with pandas for smaller dataframes, but my current data is too large to fit in memory.
Example:
raw_data_pd = pd.DataFrame({"ID": [1, 1], "Col_ID": ["A", "B"], "value": ["Alice", "Bob"]})
pivoted_data_pd = raw_data_pd.pivot_table(index="ID", columns="Col_ID", values="value", aggfunc="first")
This would produce the desired output
Col_ID A B
1 Alice Bob
Trying the same with dask does not work as pivot_table in dask.dataframe does not support aggfunc=“first” but needs “mean”, “sum” or “count” and thus numerical data.
raw_data_dd = dd.from_pandas(raw_data_pd, npartitions=1)
pivoted_data_dd = raw_data_dd.categorize(columns=["Col_ID"]).pivot_table(index="ID", columns="Col_ID", values="value")
Is there any way I can circumvent this problem, except for chunking my original file and use pandas on the chunks?
Hi @nordange and welcome! You can get close with something like this, if you don’t need a multiindex:
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({"ID": [1, 1, 1, 1], "Col_ID": ["A", "A", "B", "B"], "value": ["Alice", "Bob", "Susan", "Kim"]})
ddf = dd.from_pandas(df, npartitions=2)
ddf.groupby(['Col_ID', 'ID']).first().compute()
However, there is still work being done to fully support a multi-index in Dask DataFrame. You might also find these docs on various aggregation strategies helpful.
2 Likes
Thank you very much, @scharlottej13.
Yes, that looks similar to what I need, except for the multiindex.
Also, I forgot to mention: For this use case I know that I do not have multiple values for each ID and Col_ID, so actually I do not need any aggregation as such, but pandas’ pivot_table has the very convenient aggunc=“first” which works for non-numeric data.
@nordange no problem! I agree that Pandas’ pivot_table
makes it much more convenient to work with non-numeric data. I would encourage you to open up an issue for a feature request for Dask’s pivot_table
to support first
and last
!
Great suggestion! Thank you for your support.
2 Likes