I tried writing a Dask DataFrame with a RangeIndex
to a Parquet file and then reading it back in. I noticed the RangeIndex
wasn’t preserved and was instead converted to an Int64Index
. Is this expected?
Here’s an example:
Create a Panda’s DataFrame with a RangeIndex:
>>> import dask.dataframe as dd
>>> import pandas as pd
>>> dfp = pd.DataFrame(data = {'col0': [2, 4, 6, 8]}, index = range(4))
>>> dfp.index
RangeIndex(start=0, stop=4, step=1)
>>> dfd = dd.from_pandas(dfp, npartitions = 2)
Sanity check that partitions keep the RangeIndex:
>>> _ = dfd.map_partitions(lambda df: print(df.index)).compute()
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=0, stop=2, step=1)
RangeIndex(start=2, stop=4, step=1)
Write out DataFrame to Parquet file, read it back in, see that RangeIndex
is now Int64Index
.
>>> dfd.to_parquet('/tmp/xx')
>>> dd.read_parquet('/tmp/xx').index.compute()
Int64Index([0, 1, 2, 3], dtype='int64')
I can use calculate_divisions
in read_parquet
and Dask seems to quickly compute the partition divisions using the Parquet metadata, so maybe I should not worry about having a RangeIndex
for quick selection by index? I’d like to avoid reading the entire first half of a given Parquet file to get a chunk of data points starting in the middle, for example.
>>> dd.read_parquet('/tmp/xx', calculate_divisions = True).divisions
(0, 2, 3)
Side note: It seems Dask can’t calculate divisions for a Parquet file created by Pandas with a RangeIndex
, is this also expected?
>>> dfp.to_parquet('/tmp/xx.pq')
>>> dd.read_parquet('/tmp/xx.pq', calculate_divisions = True).divisions
(None, None)