Pivot_table very slow

martin · June 15, 2022, 9:40pm

I’m collating a simple pivot table on which position was owned by which fund on which date. It reads a fairly large parquet file per day, extracting just 3 columns. Resultant for 1 year is ~100K rows, 365 columns, each element containing Nan or a single short string (for “fund”).

This takes 46 seconds for one year of data, if I fall back to pandas before creating the pivot table.

ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.compute().pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’)

However, that seems like I’m pushing a lot of data around, I would think it’d be more efficient, or at least as efficient to have dask do the pivot table:

ddf = dd.read_parquet(“s3://…/2021*.parquet”, columns=[“as_of_date”, “position_id”, “fund”]).categorize(columns=[‘as_of_date’])
ddf.pivot_table(index=‘position_id’, columns=‘as_of_date’, values=‘fund’, aggfunc=‘first’).compute()

But that takes 12 minutes, and if I give it more than 1 year of data, dies a with KilledWorker exception.

Any guidance as to how to diagnose what’s going on here? I’m not clear why it struggles so much to do the pivot_table. FWIW, there are no duplicates, but pivot() doesn’t exist in dask, so aggfunc=‘first’ is the simplest placeholder I can think of to make it work

pavithraes · June 28, 2022, 3:04pm

Solution here: map_partitions of both pivot_table and pivot fail · Issue #9191 · dask/dask · GitHub

Topic		Replies	Views
Pivot_table doesnt work same as pandas Dask DataFrame	1	388	May 30, 2023
Using pivot_table with non-numerical data Dask DataFrame	4	985	January 25, 2022
Slow processing of parquet dataset using the distributed client Dask DataFrame distributed	1	375	October 11, 2022
Dask.read_sql_table() too slow Distributed	7	351	June 16, 2023
Speeding up hive partitioned queries Dask DataFrame parquet	4	109	October 25, 2024

Pivot_table very slow

Related topics