I’m collating a simple pivot table on which position was owned by which fund on which date. It reads a fairly large parquet file per day, extracting just 3 columns. Resultant for 1 year is ~100K rows, 365 columns, each element containing Nan or a single short string (for “fund”).
This takes 46 seconds for one year of data, if I fall back to pandas before creating the pivot table.
ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.compute().pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’)
However, that seems like I’m pushing a lot of data around, I would think it’d be more efficient, or at least as efficient to have dask do the pivot table:
ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’).compute()
But that takes 12 minutes, and if I give it more than 1 year of data, dies a with KilledWorker exception.
Any guidance as to how to diagnose what’s going on here? I’m not clear why it struggles so much to do the pivot_table. FWIW, there are no duplicates, but pivot() doesn’t exist in dask, so aggfunc=‘first’ is the simplest placeholder I can think of to make it work