Hi @es-code-bar,
Yes probably, but I’m wondering if this wouldn’t be a nice feature for Dask. If I understand correctly, hash partitionned DataFrames should be able to optimize things for joins.
For the other part, reading from the documentation:
calculate_divisions bool, default False
Whether to use min/max statistics from the footer metadata (or global
_metadatafile) to calculate divisions for the output DataFrame collection. Divisions will not be calculated if statistics are missing. This option will be ignored ifindexis not specified and there is no physical index column specified in the custom “pandas” Parquet metadata. Note thatcalculate_divisions=Truemay be extremely slow when no global_metadatafile is present, especially when reading from remote storage. Set this toTrueonly when known divisions are needed for your workload (see Partitions).
Did you try
ddf = dd.read_parquet(
dataset,
calculate_divisions = True
)
?