DataFrame.to_parquet converts RangeIndex to Int64Index

I tried writing a Dask DataFrame with a RangeIndex to a Parquet file and then reading it back in. I noticed the RangeIndex wasn’t preserved and was instead converted to an Int64Index. Is this expected?

Here’s an example:

Create a Panda’s DataFrame with a RangeIndex:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> dfp = pd.DataFrame(data = {'col0': [2, 4, 6, 8]}, index = range(4))
>>> dfp.index
RangeIndex(start=0, stop=4, step=1)
>>> dfd = dd.from_pandas(dfp, npartitions = 2)

Sanity check that partitions keep the RangeIndex:

>>> _ = dfd.map_partitions(lambda df: print(df.index)).compute()
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=2, stop=4, step=1)

Write out DataFrame to Parquet file, read it back in, see that RangeIndex is now Int64Index.

>>> dfd.to_parquet('/tmp/xx')
>>> dd.read_parquet('/tmp/xx').index.compute()
Int64Index([0, 1, 2, 3], dtype='int64')

I can use calculate_divisions in read_parquet and Dask seems to quickly compute the partition divisions using the Parquet metadata, so maybe I should not worry about having a RangeIndex for quick selection by index? I’d like to avoid reading the entire first half of a given Parquet file to get a chunk of data points starting in the middle, for example.

>>> dd.read_parquet('/tmp/xx', calculate_divisions = True).divisions
(0, 2, 3)

Side note: It seems Dask can’t calculate divisions for a Parquet file created by Pandas with a RangeIndex, is this also expected?

>>> dfp.to_parquet('/tmp/xx.pq')
>>> dd.read_parquet('/tmp/xx.pq', calculate_divisions = True).divisions
(None, None)

Hi @twrightsman, welcome to this forum!

I’m able to reproduce your issue, thanks for the code.

The important point is that when you write the parquet file with Pandas and read it back with Dask, the RangeIndex is kept:


returns: RangeIndex(start=0, stop=4, step=1)

I’m not sure if this is a bug or a constraint with partitioned dataset written to Parquet. Maybe @rjzamora has some thought?

I was able to find this old Stackoverflow question, but I’m not sure if it is still relevant.

I would say this is normal, considering that if you create the Parquet file with Pandas, there won’t be any division, will it?

Hi @guillaumeeb , thank you!

Good point on Dask recognizing a RangeIndex on a Pandas-written Parquet, it inspired me to check if this worked on multiple Pandas Parquet files read into a single Dask DataFrame:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> dfp = pd.DataFrame(data = {'col0': [2, 4, 6, 8]}, index = range(4))
>>> dfp2 = pd.DataFrame(data = {'col0': [10, 12, 14, 16]}, index = range(4, 8))
>>> dfp.to_parquet('/tmp/xx.1.pq')
>>> dfp2.to_parquet('/tmp/xx.2.pq')
>>> dfd2 = dd.read_parquet(['/tmp/xx.1.pq', '/tmp/xx.2.pq'], calculate_divisions = True)
>>> dfd2.index.compute()
RangeIndex(start=0, stop=8, step=1)
>>> dfd2.divisions
(None, None, None)

So yes, Dask correctly combines a RangeIndex across multiple Parquets when written by Pandas.

However, it still doesn’t seem to be able to calculate the divisions correctly. I would assume it should, given the RangeIndex should contain the start and stop of each Parquet file.