Hi Everyone!
Hope you are all doing great.
I need to join two parquets in an inner operation using the set index. One of the parquet contains significant less amount of rows and columns than the other, Dask is suffering and I suspect is because the number of partitions and size is very different between this two parquets. I would greatly appreciate if anyone has some advice or guidance on how can I approach this task. Thanks a lot in advance!
Hi @MasterSorcerer, welcome to Dask community!
Would the small Parquet dataset fit in memory? If so, you could try to experiment Large to Small Joins.
If the two dataset don’t fit into memory, then maybe you could do a Sorted join?