I have the following two dataframes:
df_1
is a Dask dataframe containing a col_to_merge
. It consists of many partitions but does not have known divisions nor an index. It is possible for me to create an arbitrary index if necessary. I do want the ordering and partitioning of this dataframe to be retained.
df_2
is another very large Dask dataframe. This dataframe is indexed on col_to_merge
and has known divisions.
Now I want to do the following operation:
df_1 = dd.merge(left=df_1,
right=df_2,
how='left',
left_on='col_to_merge'
right_index=True)
Simply calling this merge works, however the result of df_1
gets a completely different partitioning (trying to match that of df_2
), which is not the goal.
I have currently solved this by doing:
- Create an arbitrary index and divisions on
df_1
- Perform the merge
- Restore the order of
df_1
by callingset_index
with this arbitrary index and divisions
This produces the expected result but feels a bit hacky and is very expensive to run.
Do you have a good suggestion on how to tackle this case?
Thank you in advance for your considerations!