Best way to partition a dataframe respecting boundaries of row subgroups

u_ser · April 18, 2022, 7:38pm

Hello. I have a (large) dataset that is composed of multiple subgroups of rows (each subgroup could have a different number of rows). The dataset is read in parallel not respecting these subgroups boundaries. Is there a way to perform a re partitioning of this dataframe keeping each dask dataframe partition containing only complete subgroups? For instance each partition having only 1000 complete subgroups?

The only way that I know to do that is to use the repartition function and to manually specify the partition points.

thanks

pavithraes · April 28, 2022, 12:31pm

@u_ser Hi, and welcome to Discourse! Would you be able to share a minimal, reproducible example? It’ll allow us to help you better!

The only way that I know to do that is to use the repartition function and to manually specify the partition points.

From your description, this is the only way I can think of too. But, I’d be happy to test out a few things based on your minimal example.

Topic		Replies	Views
How does Dask determine partitions? Dask DataFrame partitioning , distributed	2	551	January 24, 2023
Dask shuffling between partitions Dask DataFrame	8	1175	February 22, 2022
Shuffle and shard dask dataframe Dask DataFrame	7	731	February 9, 2022
List of Dask Dataframe operations that could be run in parallel without using map_partitions Dask DataFrame	4	39	December 6, 2024
AttributeError: 'DataFrame' object has no attribute 'repartition' Dask DataFrame dask-array	3	2695	January 20, 2022

Best way to partition a dataframe respecting boundaries of row subgroups

Related topics