How to efficiently merge two parquets that are very dissimilar in size and partitions number

MasterSorcerer · October 7, 2023, 7:16pm

Hi Everyone!
Hope you are all doing great.
I need to join two parquets in an inner operation using the set index. One of the parquet contains significant less amount of rows and columns than the other, Dask is suffering and I suspect is because the number of partitions and size is very different between this two parquets. I would greatly appreciate if anyone has some advice or guidance on how can I approach this task. Thanks a lot in advance!

guillaumeeb · October 9, 2023, 1:31pm

Hi @MasterSorcerer, welcome to Dask community!

Would the small Parquet dataset fit in memory? If so, you could try to experiment Large to Small Joins.

If the two dataset don’t fit into memory, then maybe you could do a Sorted join?

Topic		Replies	Views
Dataframe merges Dask DataFrame shuffling	1	337	December 1, 2021
Partition-wise joins (perfectly aligned partitions) using map_partitions Dask DataFrame	1	19	November 29, 2024
How to do Range Joins with Dask? Dask DataFrame	1	217	February 9, 2022
Creating a new dask df using columns from 2 dataframes and keeping the index of the first Dask DataFrame dask-array , merge	15	109	July 31, 2024
Local cluster unable to handle larger-than-memory parquet file Distributed	1	115	February 28, 2024

How to efficiently merge two parquets that are very dissimilar in size and partitions number

Related topics