Hello. I have a (large) dataset that is composed of multiple subgroups of rows (each subgroup could have a different number of rows). The dataset is read in parallel not respecting these subgroups boundaries. Is there a way to perform a re partitioning of this dataframe keeping each dask dataframe partition containing only complete subgroups? For instance each partition having only 1000 complete subgroups?
The only way that I know to do that is to use the repartition function and to manually specify the partition points.
thanks