Re-partioning data frame and saving to parquet loses index and divisions

Hi @es-code-bar,

Yes probably, but I’m wondering if this wouldn’t be a nice feature for Dask. If I understand correctly, hash partitionned DataFrames should be able to optimize things for joins.

For the other part, reading from the documentation:

calculate_divisions bool, default False

Whether to use min/max statistics from the footer metadata (or global _metadata file) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored if index is not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note that calculate_divisions=True may be extremely slow when no global _metadata file is present, especially when reading from remote storage. Set this to True only when known divisions are needed for your workload (see Partitions).

Did you try

ddf = dd.read_parquet(
    dataset,
    calculate_divisions = True
)

?