How do I calculate ratios from groupby transforms in dask?

I have a dataset like this where each row is player data:

>>> df.head()
game_size match_id party_size player_assists player_kills player_name team_id team_placement weights
0 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 SnuffIes 4 18 0
1 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 1 Ozon3r 4 18 0
2 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 bovize 5 33 0
3 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 0 sbahn87 5 33 0
4 37 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO 2 0 2 GeminiZZZ 14 11 0

Source: Full Dataset - Compressed 126MB, Decompressed 1.18GB

I need to create a new column called weights where each row is a number between 0 and 1. It needs to be calculated as the total number of kills per player (player_kills) divided by the total number of kill per team.

My Attempt

import dask.dataframe as dd

df = dd.read_parquet("pubg")
df['weights'] = 0
total_kills = df.groupby(['match_id', 'team_id'])['player_kills'].transform('sum')
df['total_kills'] = total_kills
print(df.compute())

This is as far as I got until I got this error:

C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py:5: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .transform(func)
  After:  .transform(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .transform(func, meta=('x', 'f8'))            for series result
  total_kills = df.groupby(['match_id', 'team_id'])['player_kills'].transform('sum')
Traceback (most recent call last):
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 7, in <module>
    print(df.compute())
          ^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\threaded.py", line 89, in get
    results = get_async(
              ^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 511, in get_async
    raise_exception(exc, tb)
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 319, in reraise
    raise exc
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 224, in execute_task
    result = _execute_task(task, data)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\pandas\core\indexes\base.py", line 4275, in reindex
    raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels

As you can see, I tried to create a total_kills column and then divide it by each player’s kill. How do I solve this issue?

Answered here: python - How to get reassign column values from groupby.aggregrate back to original dataframe in dask? - Stack Overflow

1 Like