I have a dataset like this where each row is player data:
>>> df.head()
game_size | match_id | party_size | player_assists | player_kills | player_name | team_id | team_placement | weights | |
---|---|---|---|---|---|---|---|---|---|
0 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | SnuffIes | 4 | 18 | 0 |
1 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 1 | Ozon3r | 4 | 18 | 0 |
2 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | bovize | 5 | 33 | 0 |
3 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 0 | sbahn87 | 5 | 33 | 0 |
4 | 37 | 2U4GBNA0YmnNZYkzjkfgN4ev-hXSrak_BSey_YEG6kIuDG9fxFrrePqnqiM39pJO | 2 | 0 | 2 | GeminiZZZ | 14 | 11 | 0 |
Source: Full Dataset - Compressed 126MB, Decompressed 1.18GB
I need to create a new column called weights
where each row is a number between 0 and 1. It needs to be calculated as the total number of kills per player (player_kills
) divided by the total number of kill per team.
My Attempt
import dask.dataframe as dd
df = dd.read_parquet("pubg")
df['weights'] = 0
total_kills = df.groupby(['match_id', 'team_id'])['player_kills'].transform('sum')
df['total_kills'] = total_kills
print(df.compute())
This is as far as I got until I got this error:
C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py:5: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .transform(func)
After: .transform(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .transform(func, meta=('x', 'f8')) for series result
total_kills = df.groupby(['match_id', 'team_id'])['player_kills'].transform('sum')
Traceback (most recent call last):
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\data\process.py", line 7, in <module>
print(df.compute())
^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\threaded.py", line 89, in get
results = get_async(
^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 511, in get_async
raise_exception(exc, tb)
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 319, in reraise
raise exc
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\dask\local.py", line 224, in execute_task
result = _execute_task(task, data)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\taven\PycharmProjects\openskill.py\benchmark\venv\3.11\Lib\site-packages\pandas\core\indexes\base.py", line 4275, in reindex
raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
As you can see, I tried to create a total_kills
column and then divide it by each player’s kill. How do I solve this issue?