How to drop duplicates by string id for a large dataframe?

Hi,

I have a dataframe more than 70 million rows of data,this dataframe has about 15 columns, I need to drop duplicates rows by column name “id”, but it seems Dask is using high memory and getting very slow when reach the finish, after a long wait, the worker crashed.

What are some good practices and ideas for using Dask to deduplicate data of this magnitude ? please help
Thanks.

Hi @woestler, welcome to Dask community!

Do you have a lot of duplicates for a given ID in this Dataframe? What is the error message you get?

If the existing drop_duplicates doesn’t work, you could try to implement a custom version using shuffle as in this StackOverflow thread.

Hi,

Thanks for your reply, I have implemented a custom version of drop_duplicates, it’s working as expected, but when I use P2P shuffle for the dataset, I got an Arrow Type Error, please check the code.

import dask.dataframe as dd
import pandas as pd

data = [
    {'companies': [], 'id': 'a', 'social': {'name': None}},
    {'companies': [{'id': 3},  {'id': 5}], 'id': 'b', 'social': {'name': None}},
    {'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
    {'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)

# extract id from dict
def extract_ids(companies):
    return [company['id'] for company in companies]

ddf['companies'] = ddf['companies'].map_partitions(lambda x: x.apply(lambda y: [c.get('id') for c in y]), meta=('companies', 'object'))

# shuffle and drop duplicates
ddf = ddf.shuffle(on='id', shuffle="p2p", ignore_index=True)
ddf = ddf.map_partitions(lambda parts: parts.drop_duplicates(subset='id'))
ddf.compute()

If I set a string social name for the first row

data = [
    {'companies': [], 'id': 'a', 'social': {'name': 'test'}},
    {'companies': [{'id': 3},  {'id': 5}], 'id': 'b', 'social': {'name': None}},
    {'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
    {'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]

the code is works, could you please help me to solve this issue ?

Thanks

Hi @woestler,

Just tried your code with a LocalCluster, and it worked with the social name to None. Which version of Pandas/Distributed are you using (I might be the one out of date).

result:

 	companies 	id 	social
0 	[3, 5] 	b 	{'name': None}
0 	[] 	a 	{'name': None}