Maintaining index between .values and .to_dask_dataframe

stig · February 22, 2024, 11:47am

Hi,

I’m still fairly new to Dask and getting my head around how partitioning and distributed indexes work. I’m wondering if switching between Dask Dataframes and Dask Arrays without altering partitions guarantees that indexes are preserved. Consider the following:

import numpy as np
import dask.dataframe as dd
import pandas as pd

ddf = dd.from_pandas(pd.DataFrame({'r': np.random.randn(10)}), chunksize= 5)

ddf2 = ddf.values.to_dask_dataframe(index=ddf.index)

ddf3 = ddf.map_partitions(lambda x: x + 1).values.to_dask_dataframe(index=ddf.index).compute()

Here I’m starting with a Dask Dataframe, get its values as a Dask Array, and then want to convert that back into a DataFrame. My question is: is there a guarantee that ddf2 will equal ddf exactly, and how does this work? Since I have 2 chunks of equal size, is there a chance that I could end up with the second chunk being assigned the first chunk of the index and vice versa?

(For context, the conversion to an array is out of my control and Dataframes fit my later processing pipeline better.)

Thanks!

guillaumeeb · February 23, 2024, 9:43am

Hi @stig, welcome to Dask community!

If you don’t alter partitions cardinality or order, I see no reason you’ll run into problems. But I have to admit this is not a strong answer.

The graph of the creation of ddf2 dataframe is pretty clear though:

Input partitions and chunks are preserved.

stig · February 23, 2024, 11:15am

Thanks Guillaume!

That makes sense, I hadn’t thought of checking the Dask graph! Is there any way of programmatically checking this so I could add it as a validation check in my pipeline?

guillaumeeb · February 23, 2024, 12:10pm

There is probably a way to check, or at least to confirm in the code that this is the case, but unfortunately I really don’t know how.

Topic		Replies	Views
Creating a new dask df using columns from 2 dataframes and keeping the index of the first Dask DataFrame dask-array , merge	15	123	July 31, 2024
DataFrame.to_parquet converts RangeIndex to Int64Index Dask DataFrame	2	514	March 3, 2023
Slicing a dask array with a dask dataframe in one compute Dask Array dask-array , distributed	6	1577	January 14, 2022
Dask shuffling between partitions Dask DataFrame	8	1207	February 22, 2022
Divisions Lost When Writing as Parquet Dask DataFrame	1	175	July 27, 2022

Maintaining index between .values and .to_dask_dataframe

Related topics