Hi @devLeitner, welcome to Dask discourse!
Could you elaborate about that, or provide an example of dataset? I’m not sure I understand, but if you have a lot of rows with the same merging id, you might have a problem:
In some cases, you may see a
MemoryError
if themerge
operation requires an internalshuffle
, because shuffling places all rows that have the same index in the same partition. To avoid this error, make sure all rows with the sameon
-column value can fit on a single partition.
See the discussion here too: Memory Leakage on single worker on merged DataFrame (after task completion).