Index does not exist on the expected division

qherm · April 12, 2024, 10:33pm

I have a dask dataframe whose divisions are (1, 5923, 11845). I want to fetch the row with index 8851. When I call df.get_partition(1).compute().loc[8851], I get an error saying the key doesn’t exist. When I call df.get_partition(0).compute().loc[8851], the item is there. Considering 8851 is a number between 5923 and 11845, I would expect it to exist in the second partition.

Can someone please explain why it isn’t? Are divisions and partitions separate concepts? I’m using dask 2023.5.0

guillaumeeb · April 17, 2024, 4:34pm

Hi @qherm, welcome to Dask community!

I believe divisions and partitions should be aligned in all cases, else this isn’t very useful.

I just tried to reproduce your issue to no avail:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame(dict(a=list('a'*1000+'b'*2000), b=list(range(3000))),
                  index=list(range(3000)))

ddf = dd.from_pandas(df, npartitions=3)

ddf.get_partition(2).compute().loc[2500]

Could you come up with a reproducer?

How does it looks like when you call df.get_partition(1).compute()? Which index values do you have?
How did you built this dataframe?

Topic		Replies	Views
Divisions Lost When Writing as Parquet Dask DataFrame	1	170	July 27, 2022
How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe? Dask DataFrame	1	93	June 19, 2024
String index divisions not working? Dask DataFrame	5	218	August 30, 2023
Why is the whole dataframe computed even when not needed Dask DataFrame distributed	1	30	September 22, 2024
Maintaining index between .values and .to_dask_dataframe Dask DataFrame	3	130	February 23, 2024

Index does not exist on the expected division

Related topics