Say I have a dataframe with 5 million records.
Each record has a customer_ID and a date, plus 100s of other columns.
Together the customer_ID and date uniquely define a row.
What I need to be able to do is sort the rows belong to a given customer_ID and assign a number to each row that indicates its order within the customer_ID group. The most recent row for that customer gets a 0, the second most recent gets a 1, etc.
In pandas you could do this by sorting on both columns, then doing a groupby on customer_ID, then doing a cumcount()
I tried doing a sort_values on date and then doing a set_index on customer_ID, but the process of setting the customer_ID jumbled the sort from the sort_values on date.
Another option is:
# A customer never has more than 13 dates.
my_dd= my_dd.groupby('customer_ID').apply(lambda x: x.nlargest(13, 'date'))
And this actually works okay in pandas: 9 seconds to do 100,000 rows. When I tried using dask it seemed prohibitively slow. I stopped the process after 30 minutes.
The next thing I’m going to do is try creating a rump dataframe with only the date and customer_ID, small enough to hold in memory so I can do all the work in pandas and then convert to a dask dataframe, do a join on customer_ID and then do a filter on having the two pay_dates match.
Any other suggestions?
If I did a merge on both customer_id and date, with customer_ID being the index, would dask use the fact that the indexes were participating in the merge make it fast? Or would it be a slow merge because it includes out-of-index columns. Since it is not just on the index, I would not be able to use the right_index and left_index paraterms.