How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe?

I have the following two dataframes:

df_1 is a Dask dataframe containing a col_to_merge. It consists of many partitions but does not have known divisions nor an index. It is possible for me to create an arbitrary index if necessary. I do want the ordering and partitioning of this dataframe to be retained.

df_2 is another very large Dask dataframe. This dataframe is indexed on col_to_merge and has known divisions.

Now I want to do the following operation:

df_1 = dd.merge(left=df_1,
                right=df_2,
                how='left',
                left_on='col_to_merge'
                right_index=True)

Simply calling this merge works, however the result of df_1 gets a completely different partitioning (trying to match that of df_2), which is not the goal.

I have currently solved this by doing:

  1. Create an arbitrary index and divisions on df_1
  2. Perform the merge
  3. Restore the order of df_1 by calling set_index with this arbitrary index and divisions

This produces the expected result but feels a bit hacky and is very expensive to run.

Do you have a good suggestion on how to tackle this case?

Thank you in advance for your considerations!

Hi @CynthiaL, welcome to Dask community!

Well, as stated in dask_expr._collection.DataFrame.merge — Dask documentation

Joining one on index and one on column. In this case the divisions of dataframe merged by index (di) are used to divide the column merged dataframe (dc) one using dask.dataframe.multi.rearrange_by_divisions. In this case the merged dataframe (dm) has the exact same divisions as (di).

Which is exlained by the internal mechanism of Dask and how to perform a distributed merge. I’m not sure how you could keep the partitions of an unindexed dataframe upon merge. Your solution sounds like a good one, but yeah this involve several shuffles…

cc @fjetter @crusaderky