Divisions Lost When Writing as Parquet

I have a couple of questions:

(a) I find that when writing dask dataframe as parquet files, the divisions are lost. How to overcome this? I am using the default engine (i.e., pyarrow).

(b) If the dataframe already has index, but the divisions are absent, how to assign divisions? I can do the following to assign divisions, but the following code assumes that the index column is yet to be assigned.

dask_divisions = ddf.set_index(“id”).divisions
unique_divisions = list(dict.fromkeys(list(dask_divisions)))
ddf = ddf.set_index(“id”, divisions=unique_divisions)

@VRM1 Welcome to Discourse!

I find that when writing dask dataframe as parquet files, the divisions are lost.

Would you be able to share a minimal example? It’ll allow me to help you better. :smile:

A few notes:

import dask
import pandas as pd
import dask.dataframe as dd

ddf = dask.datasets.timeseries()

ddf.to_parquet("my_parquet_files", write_metadata_file=True)

ddf_read = dd.read_parquet("my_parquet_files", calculate_divisions=True)
print(ddf_read.divisions)

If the dataframe already has index, but the divisions are absent, how to assign divisions?

I think you can do: dask_dataframe.reset_index().set_index("index_column_name")