Hi,
I’m still fairly new to Dask and getting my head around how partitioning and distributed indexes work. I’m wondering if switching between Dask Dataframes and Dask Arrays without altering partitions guarantees that indexes are preserved. Consider the following:
import numpy as np
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'r': np.random.randn(10)}), chunksize= 5)
ddf2 = ddf.values.to_dask_dataframe(index=ddf.index)
ddf3 = ddf.map_partitions(lambda x: x + 1).values.to_dask_dataframe(index=ddf.index).compute()
Here I’m starting with a Dask Dataframe, get its values as a Dask Array, and then want to convert that back into a DataFrame. My question is: is there a guarantee that ddf2
will equal ddf
exactly, and how does this work? Since I have 2 chunks of equal size, is there a chance that I could end up with the second chunk being assigned the first chunk of the index and vice versa?
(For context, the conversion to an array is out of my control and Dataframes fit my later processing pipeline better.)
Thanks!