Maintaining index between .values and .to_dask_dataframe


I’m still fairly new to Dask and getting my head around how partitioning and distributed indexes work. I’m wondering if switching between Dask Dataframes and Dask Arrays without altering partitions guarantees that indexes are preserved. Consider the following:

import numpy as np
import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(pd.DataFrame({'r': np.random.randn(10)}), chunksize= 5)

ddf2 = ddf.values.to_dask_dataframe(index=ddf.index)

ddf3 = ddf.map_partitions(lambda x: x + 1).values.to_dask_dataframe(index=ddf.index).compute()

Here I’m starting with a Dask Dataframe, get its values as a Dask Array, and then want to convert that back into a DataFrame. My question is: is there a guarantee that ddf2 will equal ddf exactly, and how does this work? Since I have 2 chunks of equal size, is there a chance that I could end up with the second chunk being assigned the first chunk of the index and vice versa?

(For context, the conversion to an array is out of my control and Dataframes fit my later processing pipeline better.)


Hi @stig, welcome to Dask community!

If you don’t alter partitions cardinality or order, I see no reason you’ll run into problems. But I have to admit this is not a strong answer.

The graph of the creation of ddf2 dataframe is pretty clear though:

Input partitions and chunks are preserved.

Thanks Guillaume!

That makes sense, I hadn’t thought of checking the Dask graph! Is there any way of programmatically checking this so I could add it as a validation check in my pipeline?

There is probably a way to check, or at least to confirm in the code that this is the case, but unfortunately I really don’t know how.