Dataframe indexes

paultjuh · June 2, 2022, 11:36am

I’m a bit confused about how to work with the index of a dataframe, this seems basic, but I can’t find the correct solution:

If I convert a column to an index, and don’t drop it, the column won’t be available when trying to access it by it’s name. If I save it to parquet and import into Bigquery, the whole column is non existent. So I have chosen to not drop the column. But If I don’t drop the column, I can’t merge on that column anymore, it will complain about the duplicated name (which is really inconsistent as the index is not accessible by name in other parts of the dataframe api). If I merge by the index, I get a duplicated column. And I get an error when I try to rename the index, dask won’t let me.

What is the correct way to have all the above working? Do I first need to rename the column, then set the index, then rename the column back? Or is there a better solution?

pavithraes · June 6, 2022, 8:21pm

@paultjuh Welcome to Discourse!

What is the correct way to have all the above working? Do I first need to rename the column, then set the index, then rename the column back? Or is there a better solution?

Could you please share a minimal example of your workflow? It’ll allow us to help you better!

If I merge by the index, I get a duplicated column.

In general, I’d suggest exactly this, merging on the index. You can then drop the duplicated column later if you want.

And I get an error when I try to rename the index, dask won’t let me.

I think you can use ddf.index.rename:

import dask.dataframe as dd

ddf1 = dd.DataFrame.from_dict({
    'x': range(10, 20),
    'y': range(20, 30),
}, npartitions=2)
ddf1 = ddf1.set_index("y", drop=False)

ddf2 = dd.DataFrame.from_dict({
    'y': range(20, 30),
    'z': range(30, 40),
}, npartitions=2)

ddf_merged = dd.merge(ddf1, ddf2, left_index=True, right_on="y")

ddf_merged.index = ddf_merged.index.rename("index_renamed")

ddf_merged.compute()

Does this help answer your question?

paultjuh · June 7, 2022, 10:32am

Thanks. I worked out how to rename the index, like in your example. I still don’t know if I use dask correctly though. My workflow:

1 load data from a third party source
2 merge with the existing data to remove duplicates
3 save to parquet
4 load parquet into bigquery

I’m not sure if my step 2 of the process is really efficient:
steps for this part of the workflow

load data from parquet (just the column I need to identify existing records: hash)
merge with the loaded data from the third party source, on the hash column, while this is not the index, both dataframes have the same index (time), a record with the same hash has the same index (should I merge on both the time column and hash column to make sure this runs efficiently?
output both a dataframe with the duplicates, and with the duplicates removed
merge the dataframe with duplicates with other the other dataframes with records which are all linked to the hash column, to remove all other duplicated data.

I was also wondering why it is so hard to work with the index:

how to rename is hard to find, and the documented method on the dataframe rename even takes an index parameter (which will give an error if you use it)
the data is still there, has a column name, but doesn’t allow you to access it by column name (except when doing a merge, which is strange), which complicates the code because you can’t use the name, buy have to use “index” instead
if I drop the column which is also in the index (technically duplicated data), I can’t access this data in other places (for example loading it into bigquery won’t load the index.

paultjuh · June 16, 2022, 9:28am

It doesn’t really answer my question, see my reply below.

Topic		Replies	Views
Creating a new dask df using columns from 2 dataframes and keeping the index of the first Dask DataFrame dask-array , merge	15	112	July 31, 2024
pandas.read_csv(index_col=False) with dask ? index problem Dask DataFrame	0	309	November 16, 2022
Index name changed after groupby() and apply() and missing column Dask DataFrame	0	1660	November 21, 2022
How to efficiently left merge two large Dask dataframes without matching on index and while retaining partitioning in left dataframe? Dask DataFrame	1	106	June 19, 2024
Align a secondary DataFrame to use the same workers and index structure as a primary DataFrame Dask DataFrame	6	50	January 30, 2025

Dataframe indexes

Related topics