Pivot_table very slow

I’m collating a simple pivot table on which position was owned by which fund on which date. It reads a fairly large parquet file per day, extracting just 3 columns. Resultant for 1 year is ~100K rows, 365 columns, each element containing Nan or a single short string (for “fund”).

This takes 46 seconds for one year of data, if I fall back to pandas before creating the pivot table.

ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.compute().pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’)

However, that seems like I’m pushing a lot of data around, I would think it’d be more efficient, or at least as efficient to have dask do the pivot table:

ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’).compute()

But that takes 12 minutes, and if I give it more than 1 year of data, dies a with KilledWorker exception.

Any guidance as to how to diagnose what’s going on here? I’m not clear why it struggles so much to do the pivot_table. FWIW, there are no duplicates, but pivot() doesn’t exist in dask, so aggfunc=‘first’ is the simplest placeholder I can think of to make it work

Solution here: map_partitions of both pivot_table and pivot fail · Issue #9191 · dask/dask · GitHub