Best way to partition a dataframe respecting boundaries of row subgroups

Hello. I have a (large) dataset that is composed of multiple subgroups of rows (each subgroup could have a different number of rows). The dataset is read in parallel not respecting these subgroups boundaries. Is there a way to perform a re partitioning of this dataframe keeping each dask dataframe partition containing only complete subgroups? For instance each partition having only 1000 complete subgroups?

The only way that I know to do that is to use the repartition function and to manually specify the partition points.

thanks

@u_ser Hi, and welcome to Discourse! Would you be able to share a minimal, reproducible example? It’ll allow us to help you better!

The only way that I know to do that is to use the repartition function and to manually specify the partition points.

From your description, this is the only way I can think of too. But, I’d be happy to test out a few things based on your minimal example. :smile: