How does Dask determine partitions?

filpano · January 23, 2023, 12:41pm

I have date partitioned S3 data which might contain duplicate entries. So far as I can determine, these data points happen extremely close to each other time-wise, ultimately ending up in the same .parquet File.

As far as I can see, Dask maps each of these .parquet files to its own partition. This makes sense, since I suppose it’s a pretty good approach without digging deeper into the data by default.

This allows me to do something like ddf.map_partitions(lambda df: df.drop_duplicates(subset['id'])) since I know that all duplicates are located within the same partition as the original. This assumption might get a bit shaky with some possible edge cases, but the basic premise seems to hold well.

Unfortunately, shuffle or set_index or similar computations would be prohibitively expensive for my data set since we need it to be relatively fast, and the methods mentioned explode the task graph into millions of tasks.

Bottom line, my question is: is this the default partitioning behaviour? Can I explicitly set it some how so that it stays consistent through upgrades? Can I manipulate this behaviour?

So far I’ve found the documentation page on Dask’s internal design, which contains some information on partitions, but it’s not much.

filpano · January 23, 2023, 1:52pm

It seems I’ve found what I was looking for in the Dask Dataframe and Parquet section of the docs.

The corresponding parameter to explicitly set this behaviour is split_row_groups in dd.read_parquet.

guillaumeeb · January 24, 2023, 7:19am

Hi @filpano,

Thanks for sharing your problem and solution, this is really appreciated!

Topic		Replies	Views
Re-partioning data frame and saving to parquet loses index and divisions Dask DataFrame parquet , indexing , partitioning	2	49	February 20, 2025
Dask map_partition() function produce duplicate result instead of the result of each partition Dask DataFrame distributed	0	165	November 4, 2022
List of Dask Dataframe operations that could be run in parallel without using map_partitions Dask DataFrame	4	45	December 6, 2024
Best way to partition a dataframe respecting boundaries of row subgroups Dask DataFrame	1	211	April 28, 2022
Shuffle and shard dask dataframe Dask DataFrame	7	731	February 9, 2022

How does Dask determine partitions?

Related topics