DataFrame.to_parquet converts RangeIndex to Int64Index

twrightsman · March 2, 2023, 4:24pm

I tried writing a Dask DataFrame with a RangeIndex to a Parquet file and then reading it back in. I noticed the RangeIndex wasn’t preserved and was instead converted to an Int64Index. Is this expected?

Here’s an example:

Create a Panda’s DataFrame with a RangeIndex:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> dfp = pd.DataFrame(data = {'col0': [2, 4, 6, 8]}, index = range(4))
>>> dfp.index
RangeIndex(start=0, stop=4, step=1)
>>> dfd = dd.from_pandas(dfp, npartitions = 2)

Sanity check that partitions keep the RangeIndex:

>>> _ = dfd.map_partitions(lambda df: print(df.index)).compute()
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=2, stop=4, step=1)

Write out DataFrame to Parquet file, read it back in, see that RangeIndex is now Int64Index.

>>> dfd.to_parquet('/tmp/xx')
>>> dd.read_parquet('/tmp/xx').index.compute()
Int64Index([0, 1, 2, 3], dtype='int64')

I can use calculate_divisions in read_parquet and Dask seems to quickly compute the partition divisions using the Parquet metadata, so maybe I should not worry about having a RangeIndex for quick selection by index? I’d like to avoid reading the entire first half of a given Parquet file to get a chunk of data points starting in the middle, for example.

>>> dd.read_parquet('/tmp/xx', calculate_divisions = True).divisions
(0, 2, 3)

Side note: It seems Dask can’t calculate divisions for a Parquet file created by Pandas with a RangeIndex, is this also expected?

>>> dfp.to_parquet('/tmp/xx.pq')
>>> dd.read_parquet('/tmp/xx.pq', calculate_divisions = True).divisions
(None, None)

guillaumeeb · March 3, 2023, 9:41am

Hi @twrightsman, welcome to this forum!

I’m able to reproduce your issue, thanks for the code.

The important point is that when you write the parquet file with Pandas and read it back with Dask, the RangeIndex is kept:

dfp.to_parquet('/tmp/yy')
dd.read_parquet('/tmp/yy').index.compute()

returns: RangeIndex(start=0, stop=4, step=1)

I’m not sure if this is a bug or a constraint with partitioned dataset written to Parquet. Maybe @rjzamora has some thought?

I was able to find this old Stackoverflow question, but I’m not sure if it is still relevant.

I would say this is normal, considering that if you create the Parquet file with Pandas, there won’t be any division, will it?

twrightsman · March 3, 2023, 2:55pm

Hi @guillaumeeb , thank you!

Good point on Dask recognizing a RangeIndex on a Pandas-written Parquet, it inspired me to check if this worked on multiple Pandas Parquet files read into a single Dask DataFrame:

>>> import dask.dataframe as dd
>>> import pandas as pd
>>> dfp = pd.DataFrame(data = {'col0': [2, 4, 6, 8]}, index = range(4))
>>> dfp2 = pd.DataFrame(data = {'col0': [10, 12, 14, 16]}, index = range(4, 8))
>>> dfp.to_parquet('/tmp/xx.1.pq')
>>> dfp2.to_parquet('/tmp/xx.2.pq')
>>> dfd2 = dd.read_parquet(['/tmp/xx.1.pq', '/tmp/xx.2.pq'], calculate_divisions = True)
>>> dfd2.index.compute()
RangeIndex(start=0, stop=8, step=1)
>>> dfd2.divisions
(None, None, None)

So yes, Dask correctly combines a RangeIndex across multiple Parquets when written by Pandas.

However, it still doesn’t seem to be able to calculate the divisions correctly. I would assume it should, given the RangeIndex should contain the start and stop of each Parquet file.

Topic		Replies	Views
Divisions Lost When Writing as Parquet Dask DataFrame	1	170	July 27, 2022
Re-partioning data frame and saving to parquet loses index and divisions Dask DataFrame parquet , indexing , partitioning	2	39	February 20, 2025
String index divisions not working? Dask DataFrame	5	220	August 30, 2023
Maintaining index between .values and .to_dask_dataframe Dask DataFrame	3	130	February 23, 2024
Dask .to_parquet() errors when saving lists of integers (object types) with convert-string: False	1	2036	January 25, 2024

DataFrame.to_parquet converts RangeIndex to Int64Index

Related topics