Should `set_index` change the row count of a DataFrame?

In dataproc, I have some code where I’m doing some processing like

import dask.dataframe as dd

df = dd.read_parquet("gs://input-path", gather_statistics=False)
df["index"] = 1
df["index"] = df["index"].cumsum()  # keep track of original row order
"""
do some more processing
"""
df = df.reset_index("index")  # want to restore original row order
df.to_parquet("gs://output-path", overwrite=True)

The second to last line, where I reset the index, reduces the row count of df. Alternatively, if I comment it out, the row count of df does not change. Should this be happening?

Hello @hahdawg and welcome! I wouldn’t expect dask.DataFrame.reset_index to reduce the number of rows in your Dask DataFrame. Would you be able to share a minimally reproducible example? The main difference between the Pandas and Dask implementation is in Dask the new index will restart at 0 for each partition, due to the inability to know the length of the entire dataframe before computing (here are more details).

1 Like