How to sort within groups?

DRudel · July 30, 2022, 2:26am

Say I have a dataframe with 5 million records.
Each record has a customer_ID and a date, plus 100s of other columns.
Together the customer_ID and date uniquely define a row.

What I need to be able to do is sort the rows belong to a given customer_ID and assign a number to each row that indicates its order within the customer_ID group. The most recent row for that customer gets a 0, the second most recent gets a 1, etc.

In pandas you could do this by sorting on both columns, then doing a groupby on customer_ID, then doing a cumcount()

I tried doing a sort_values on date and then doing a set_index on customer_ID, but the process of setting the customer_ID jumbled the sort from the sort_values on date.

Another option is:

# A customer never has more than 13 dates.
my_dd= my_dd.groupby('customer_ID').apply(lambda x: x.nlargest(13, 'date'))

And this actually works okay in pandas: 9 seconds to do 100,000 rows. When I tried using dask it seemed prohibitively slow. I stopped the process after 30 minutes.

The next thing I’m going to do is try creating a rump dataframe with only the date and customer_ID, small enough to hold in memory so I can do all the work in pandas and then convert to a dask dataframe, do a join on customer_ID and then do a filter on having the two pay_dates match.

Any other suggestions?
If I did a merge on both customer_id and date, with customer_ID being the index, would dask use the fact that the indexes were participating in the merge make it fast? Or would it be a slow merge because it includes out-of-index columns. Since it is not just on the index, I would not be able to use the right_index and left_index paraterms.

pavithraes · August 5, 2022, 1:33pm

@DRudel The groupby operation creates ~400k very small tasks (inferred from: 3 million rows, and each customer has at most 13 unique dates) – which is causing a bottleneck. @ncclementi and I think Dask is having trouble scheduling that many tasks, and hence, it’s stalling.

We’re not sure of a better approach to do this particular operation though.

A few questions:

Which scheduler are you using? If you’re using distributed, you can try switching to single-machine threaded and see if that helps.
Would you be able to share a complete minimal example? We’ll be happy to look into it further.

What I need to be able to do is sort the rows belong to a given customer_ID and assign a number to each row that indicates its order within the customer_ID group. The most recent row for that customer gets a 0, the second most recent gets a 1, etc.

Could you please share details around why you’re looking to do the above? And, what is the computation after this? – I’m asking because maybe there are alternative options or we can try to optimize the complete workflow as a whole.

@ian Do you have thoughts on this?

Topic		Replies	Views
Sorting a large dataframe takes a long time then fails Dask DataFrame timeseries	8	45	May 30, 2025
Please explain sorting Dask DataFrame delayed , distributed	5	1673	May 17, 2023
How to get groupby group names with Dask.Dataframes Dask DataFrame	7	1281	February 11, 2022
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	83	September 19, 2024
Writing a sort in Dask Distributed dask-array	2	51	January 27, 2025

How to sort within groups?

Related topics