Filter df based on indices of other dfs

deklanw · August 23, 2023, 3:10pm

I have four dfs all backed up with reasonably-partitioned Parquet files.

The four dfs are role_df, school_df, company_df, and profile_df.

What I’m trying to do, in pandas:

org_ids = company_df.index.union(school_df.index)
profile_ids = profile_df.index

filtered_role_df = role_df[role_df.org_id.isin(org_ids) & role_df.profile_id.isin(profile_ids)]

# now persist filtered_role_df

In English: I’m trying to combine the index of school_df and company_df into one set-like object: org_ids and take the index of profile_df as one set-like object: profile_ids and then filter role_df by ensuring every profile_id and org_id appears in profile_ids and org_ids

role_df (and the filtered version) is too large to fit into memory, so I’m hoping that behind-the-scenes the filtering can happen to each partition separately and then be saved to Parquet again.

I’ve hit different issues trying to do this:

there is no .union method implemented on Dask indices,
when I materialize the indices with .compute and then convert them into a set, saving to Parquet stalls and I get many warnings like WARNING - full garbage collections took 29% CPU time recently (threshold: 10%)

What I found to work: materialize the indices, don’t convert to sets, union them, and then proceed

profile_ids = profile_df.index.compute()
company_ids = company_df.index.compute()
school_ids = school_df.index.compute()

org_ids = company_ids.union(school_ids)

filtered_role_df = role_df[role_df.org_id.isin(org_ids) & role_df.profile_id.isin(profile_ids)]

dd.to_parquet(filtered_role_df, ...)

This works, but I still get a warning:

UserWarning: Sending large graph of size 479.65 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

It seems unavoidable that each thread should receive both org_ids and profile_ids, right? The error suggests scattering. If I do so, I get TypeError: only list-like objects are allowed to be passed to isin(), you passed a [Future] for this line filtered_role_df =

Why does passing sets not work?

Am I doing anything blatantly wrong in my working solution? Is there any way to improve this?

NOTE:
I’m running this locally, like so:

cluster = LocalCluster(n_workers=4, processes=True, threads_per_worker=4)
client = Client(cluster)

If I wanted to scale this I’m not sure if it’s better to, e.g., have 8 workers and 4 threads/worker, or 4 workers and 8 threads/worker.

guillaumeeb · August 24, 2023, 6:37pm

It would help to have some reproducer to play with, but anyway, if I have to implement this in a distributed way, I think I’ll use merge with outer type in order to build an org_ids Dask Serie, and then two inner joins on org_id and prodile_id to keep the filtered values. I think this method would avoid materializing a huge set of IDs, and work fully distributed.

Topic		Replies	Views
[Best Practice] Set index on a DataFrame prior Join operations Dask DataFrame indexing	3	952	March 18, 2022
Align a secondary DataFrame to use the same workers and index structure as a primary DataFrame Dask DataFrame	6	50	January 30, 2025
Creating a new dask df using columns from 2 dataframes and keeping the index of the first Dask DataFrame dask-array , merge	15	112	July 31, 2024
How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe? Dask DataFrame	1	106	June 19, 2024
List of Dask Dataframe operations that could be run in parallel without using map_partitions Dask DataFrame	4	46	December 6, 2024

Filter df based on indices of other dfs

Related topics