How to sort within groups?

Say I have a dataframe with 5 million records.
Each record has a customer_ID and a date, plus 100s of other columns.
Together the customer_ID and date uniquely define a row.

What I need to be able to do is sort the rows belong to a given customer_ID and assign a number to each row that indicates its order within the customer_ID group. The most recent row for that customer gets a 0, the second most recent gets a 1, etc.

In pandas you could do this by sorting on both columns, then doing a groupby on customer_ID, then doing a cumcount()

I tried doing a sort_values on date and then doing a set_index on customer_ID, but the process of setting the customer_ID jumbled the sort from the sort_values on date.

Another option is:

# A customer never has more than 13 dates.
my_dd= my_dd.groupby('customer_ID').apply(lambda x: x.nlargest(13, 'date'))

And this actually works okay in pandas: 9 seconds to do 100,000 rows. When I tried using dask it seemed prohibitively slow. I stopped the process after 30 minutes.

The next thing I’m going to do is try creating a rump dataframe with only the date and customer_ID, small enough to hold in memory so I can do all the work in pandas and then convert to a dask dataframe, do a join on customer_ID and then do a filter on having the two pay_dates match.

Any other suggestions?
If I did a merge on both customer_id and date, with customer_ID being the index, would dask use the fact that the indexes were participating in the merge make it fast? Or would it be a slow merge because it includes out-of-index columns. Since it is not just on the index, I would not be able to use the right_index and left_index paraterms.

@DRudel The groupby operation creates ~400k very small tasks (inferred from: 3 million rows, and each customer has at most 13 unique dates) – which is causing a bottleneck. @ncclementi and I think Dask is having trouble scheduling that many tasks, and hence, it’s stalling.

We’re not sure of a better approach to do this particular operation though.

A few questions:

What I need to be able to do is sort the rows belong to a given customer_ID and assign a number to each row that indicates its order within the customer_ID group. The most recent row for that customer gets a 0, the second most recent gets a 1, etc.

  • Could you please share details around why you’re looking to do the above? And, what is the computation after this? – I’m asking because maybe there are alternative options or we can try to optimize the complete workflow as a whole.

@ian Do you have thoughts on this?