I have a couple of questions:
(a) I find that when writing dask dataframe as parquet files, the divisions are lost. How to overcome this? I am using the default engine (i.e., pyarrow).
(b) If the dataframe already has index, but the divisions are absent, how to assign divisions? I can do the following to assign divisions, but the following code assumes that the index column is yet to be assigned.
dask_divisions = ddf.set_index(“id”).divisions
unique_divisions = list(dict.fromkeys(list(dask_divisions)))
ddf = ddf.set_index(“id”, divisions=unique_divisions)
@VRM1 Welcome to Discourse!
I find that when writing dask dataframe as parquet files, the divisions are lost.
Would you be able to share a minimal example? It’ll allow me to help you better.
A few notes:
- You can set
read_parquetto get the divisions while reading your data back (this will work only if the global metadata file exists)
- If the metadata file isn’t written, you can set
- Relevant docs which have some additional information+notes:
import pandas as pd
import dask.dataframe as dd
ddf = dask.datasets.timeseries()
ddf_read = dd.read_parquet("my_parquet_files", calculate_divisions=True)
If the dataframe already has index, but the divisions are absent, how to assign divisions?
I think you can do: