I have date partitioned S3 data which might contain duplicate entries. So far as I can determine, these data points happen extremely close to each other time-wise, ultimately ending up in the same .parquet
File.
As far as I can see, Dask maps each of these .parquet
files to its own partition. This makes sense, since I suppose it’s a pretty good approach without digging deeper into the data by default.
This allows me to do something like ddf.map_partitions(lambda df: df.drop_duplicates(subset['id']))
since I know that all duplicates are located within the same partition as the original. This assumption might get a bit shaky with some possible edge cases, but the basic premise seems to hold well.
Unfortunately, shuffle
or set_index
or similar computations would be prohibitively expensive for my data set since we need it to be relatively fast, and the methods mentioned explode the task graph into millions of tasks.
Bottom line, my question is: is this the default partitioning behaviour? Can I explicitly set it some how so that it stays consistent through upgrades? Can I manipulate this behaviour?
So far I’ve found the documentation page on Dask’s internal design, which contains some information on partitions, but it’s not much.