How does Dask determine partitions?

I have date partitioned S3 data which might contain duplicate entries. So far as I can determine, these data points happen extremely close to each other time-wise, ultimately ending up in the same .parquet File.

As far as I can see, Dask maps each of these .parquet files to its own partition. This makes sense, since I suppose it’s a pretty good approach without digging deeper into the data by default.

This allows me to do something like ddf.map_partitions(lambda df: df.drop_duplicates(subset['id'])) since I know that all duplicates are located within the same partition as the original. This assumption might get a bit shaky with some possible edge cases, but the basic premise seems to hold well.

Unfortunately, shuffle or set_index or similar computations would be prohibitively expensive for my data set since we need it to be relatively fast, and the methods mentioned explode the task graph into millions of tasks.

Bottom line, my question is: is this the default partitioning behaviour? Can I explicitly set it some how so that it stays consistent through upgrades? Can I manipulate this behaviour?

So far I’ve found the documentation page on Dask’s internal design, which contains some information on partitions, but it’s not much.

It seems I’ve found what I was looking for in the Dask Dataframe and Parquet section of the docs.

The corresponding parameter to explicitly set this behaviour is split_row_groups in dd.read_parquet.

1 Like

Hi @filpano,

Thanks for sharing your problem and solution, this is really appreciated!