In dataproc, I have some code where I’m doing some processing like
import dask.dataframe as dd
df = dd.read_parquet("gs://input-path", gather_statistics=False)
df["index"] = 1
df["index"] = df["index"].cumsum() # keep track of original row order
"""
do some more processing
"""
df = df.reset_index("index") # want to restore original row order
df.to_parquet("gs://output-path", overwrite=True)
The second to last line, where I reset the index, reduces the row count of df
. Alternatively, if I comment it out, the row count of df
does not change. Should this be happening?