How to drop duplicates by string id for a large dataframe?

woestler · May 6, 2023, 10:43am

Hi,

I have a dataframe more than 70 million rows of data，this dataframe has about 15 columns, I need to drop duplicates rows by column name “id”, but it seems Dask is using high memory and getting very slow when reach the finish, after a long wait, the worker crashed.

What are some good practices and ideas for using Dask to deduplicate data of this magnitude ? please help
Thanks.

guillaumeeb · May 6, 2023, 6:57pm

Hi @woestler, welcome to Dask community!

Do you have a lot of duplicates for a given ID in this Dataframe? What is the error message you get?

If the existing drop_duplicates doesn’t work, you could try to implement a custom version using shuffle as in this StackOverflow thread.

woestler · May 10, 2023, 1:42am

Hi,

Thanks for your reply, I have implemented a custom version of drop_duplicates, it’s working as expected, but when I use P2P shuffle for the dataset, I got an Arrow Type Error, please check the code.

import dask.dataframe as dd
import pandas as pd

data = [
    {'companies': [], 'id': 'a', 'social': {'name': None}},
    {'companies': [{'id': 3},  {'id': 5}], 'id': 'b', 'social': {'name': None}},
    {'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
    {'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=2)

# extract id from dict
def extract_ids(companies):
    return [company['id'] for company in companies]

ddf['companies'] = ddf['companies'].map_partitions(lambda x: x.apply(lambda y: [c.get('id') for c in y]), meta=('companies', 'object'))

# shuffle and drop duplicates
ddf = ddf.shuffle(on='id', shuffle="p2p", ignore_index=True)
ddf = ddf.map_partitions(lambda parts: parts.drop_duplicates(subset='id'))
ddf.compute()

If I set a string social name for the first row

data = [
    {'companies': [], 'id': 'a', 'social': {'name': 'test'}},
    {'companies': [{'id': 3},  {'id': 5}], 'id': 'b', 'social': {'name': None}},
    {'companies': [{'id': 3}, {'id': 4}, {'id': 5}], 'id': 'c', 'social': {'name': 'youtube'}},
    {'companies': [{'id': 9}], 'id': 'a', 'social': {'name': 'test'}}
]

the code is works, could you please help me to solve this issue ?

Thanks

guillaumeeb · May 31, 2023, 10:42am

Hi @woestler,

Just tried your code with a LocalCluster, and it worked with the social name to None. Which version of Pandas/Distributed are you using (I might be the one out of date).

result:

 	companies 	id 	social
0 	[3, 5] 	b 	{'name': None}
0 	[] 	a 	{'name': None}

Topic		Replies	Views
Dataframe indexes Dask DataFrame	3	844	June 16, 2022
How fetch rows from another Dask dataframe by matching Dask dataframe's ID columns? Dask DataFrame	2	89	April 1, 2024
User-defined types for groupby apply with p2p shuffling Dask DataFrame	10	296	September 18, 2024
Df[cols].drop_duplicates().compute() causes ValueError: The columns in the computed data do not match the columns in the provided metadata Dask DataFrame	1	24	February 14, 2025
How to avoid shuffling when doing groupby apply? Dask DataFrame	2	1276	May 3, 2022

How to drop duplicates by string id for a large dataframe?

Related topics