Re-partioning data frame and saving to parquet loses index and divisions

guillaumeeb · February 14, 2025, 3:17pm

Yes probably, but I’m wondering if this wouldn’t be a nice feature for Dask. If I understand correctly, hash partitionned DataFrames should be able to optimize things for joins.

For the other part, reading from the documentation:

calculate_divisions bool, default False

Whether to use min/max statistics from the footer metadata (or global _metadata file) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored if index is not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note that calculate_divisions=True may be extremely slow when no global _metadata file is present, especially when reading from remote storage. Set this to True only when known divisions are needed for your workload (see Partitions).

Did you try

ddf = dd.read_parquet(
    dataset,
    calculate_divisions = True
)

?

Topic		Replies	Views
Divisions Lost When Writing as Parquet Dask DataFrame	1	186	July 27, 2022
DataFrame.to_parquet converts RangeIndex to Int64Index Dask DataFrame	2	523	March 3, 2023
Dask very slow with simple processing of large parquet file Dask DataFrame	2	1751	August 29, 2022
How does Dask determine partitions? Dask DataFrame partitioning , distributed	2	561	January 24, 2023
String index divisions not working? Dask DataFrame	5	244	August 30, 2023

Re-partioning data frame and saving to parquet loses index and divisions

Related topics