How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe?

CynthiaL · June 17, 2024, 11:30am

I have the following two dataframes:

df_1 is a Dask dataframe containing a col_to_merge. It consists of many partitions but does not have known divisions nor an index. It is possible for me to create an arbitrary index if necessary. I do want the ordering and partitioning of this dataframe to be retained.

df_2 is another very large Dask dataframe. This dataframe is indexed on col_to_merge and has known divisions.

Now I want to do the following operation:

df_1 = dd.merge(left=df_1,
                right=df_2,
                how='left',
                left_on='col_to_merge'
                right_index=True)

Simply calling this merge works, however the result of df_1 gets a completely different partitioning (trying to match that of df_2), which is not the goal.

I have currently solved this by doing:

Create an arbitrary index and divisions on df_1
Perform the merge
Restore the order of df_1 by calling set_index with this arbitrary index and divisions

This produces the expected result but feels a bit hacky and is very expensive to run.

Do you have a good suggestion on how to tackle this case?

Thank you in advance for your considerations!

guillaumeeb · June 19, 2024, 8:27am

Hi @CynthiaL, welcome to Dask community!

Well, as stated in dask_expr._collection.DataFrame.merge — Dask documentation

Joining one on index and one on column. In this case the divisions of dataframe merged by index (di) are used to divide the column merged dataframe (dc) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (dm) has the exact same divisions as (di).

Which is exlained by the internal mechanism of Dask and how to perform a distributed merge. I’m not sure how you could keep the partitions of an unindexed dataframe upon merge. Your solution sounds like a good one, but yeah this involve several shuffles…

cc @fjetter @crusaderky

Topic		Replies	Views
Dataframe merge with the partitioned dataframe Dask DataFrame merge	3	165	April 3, 2024
List of Dask Dataframe operations that could be run in parallel without using map_partitions Dask DataFrame	4	29	December 6, 2024
Maintaining index between .values and .to_dask_dataframe Dask DataFrame	3	126	February 23, 2024
Dataframe merges Dask DataFrame shuffling	1	334	December 1, 2021
Dataframe indexes Dask DataFrame	3	838	June 16, 2022

How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe?

Related topics