Divisions Lost When Writing as Parquet

VRM1 · July 25, 2022, 4:58pm

I have a couple of questions:

(a) I find that when writing dask dataframe as parquet files, the divisions are lost. How to overcome this? I am using the default engine (i.e., pyarrow).

(b) If the dataframe already has index, but the divisions are absent, how to assign divisions? I can do the following to assign divisions, but the following code assumes that the index column is yet to be assigned.

dask_divisions = ddf.set_index(“id”).divisions
unique_divisions = list(dict.fromkeys(list(dask_divisions)))
ddf = ddf.set_index(“id”, divisions=unique_divisions)

pavithraes · July 27, 2022, 8:03pm

@VRM1 Welcome to Discourse!

I find that when writing dask dataframe as parquet files, the divisions are lost.

Would you be able to share a minimal example? It’ll allow me to help you better.

A few notes:

You can set calculate_divisions=True in read_parquetto get the divisions while reading your data back (this will work only if the global metadata file exists)
If the metadata file isn’t written, you can set write_metadata_file=True in to_parquet
Relevant docs which have some additional information+notes:

import dask
import pandas as pd
import dask.dataframe as dd

ddf = dask.datasets.timeseries()

ddf.to_parquet("my_parquet_files", write_metadata_file=True)

ddf_read = dd.read_parquet("my_parquet_files", calculate_divisions=True)
print(ddf_read.divisions)

If the dataframe already has index, but the divisions are absent, how to assign divisions?

I think you can do: dask_dataframe.reset_index().set_index("index_column_name")

Topic		Replies	Views
Re-partioning data frame and saving to parquet loses index and divisions Dask DataFrame parquet , indexing , partitioning	2	46	February 20, 2025
DataFrame.to_parquet converts RangeIndex to Int64Index Dask DataFrame	2	507	March 3, 2023
String index divisions not working? Dask DataFrame	5	223	August 30, 2023
Index does not exist on the expected division Dask DataFrame	1	69	April 17, 2024
KeyError while using the read_parquet method Dask DataFrame	10	1063	August 21, 2023

Divisions Lost When Writing as Parquet

Related topics