Hi,
Thanks for your reply, I have implemented a custom version of drop_duplicates, it’s working as expected, but when I use P2P shuffle for the dataset, I got an Arrow Type Error, please check the code.
import dask.dataframe as dd
import pandas as pd
data = [
{'companies': [], 'id': 'a', 'social': {'name': None}},
{'companies': [{'id': 3}, {'id': 5}], 'id': 'b', 'social': {'name': None}},
{'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
{'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)
# extract id from dict
def extract_ids(companies):
return [company['id'] for company in companies]
ddf['companies'] = ddf['companies'].map_partitions(lambda x: x.apply(lambda y: [c.get('id') for c in y]), meta=('companies', 'object'))
# shuffle and drop duplicates
ddf = ddf.shuffle(on='id', shuffle="p2p", ignore_index=True)
ddf = ddf.map_partitions(lambda parts: parts.drop_duplicates(subset='id'))
ddf.compute()
If I set a string social name for the first row
data = [
{'companies': [], 'id': 'a', 'social': {'name': 'test'}},
{'companies': [{'id': 3}, {'id': 5}], 'id': 'b', 'social': {'name': None}},
{'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
{'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]
the code is works, could you please help me to solve this issue ?
Thanks