I’m a bit confused about how to work with the index of a dataframe, this seems basic, but I can’t find the correct solution:
If I convert a column to an index, and don’t drop it, the column won’t be available when trying to access it by it’s name. If I save it to parquet and import into Bigquery, the whole column is non existent. So I have chosen to not drop the column. But If I don’t drop the column, I can’t merge on that column anymore, it will complain about the duplicated name (which is really inconsistent as the index is not accessible by name in other parts of the dataframe api). If I merge by the index, I get a duplicated column. And I get an error when I try to rename the index, dask won’t let me.
What is the correct way to have all the above working? Do I first need to rename the column, then set the index, then rename the column back? Or is there a better solution?
@paultjuh Welcome to Discourse!
What is the correct way to have all the above working? Do I first need to rename the column, then set the index, then rename the column back? Or is there a better solution?
Could you please share a minimal example of your workflow? It’ll allow us to help you better!
If I merge by the index, I get a duplicated column.
In general, I’d suggest exactly this, merging on the index. You can then drop the duplicated column later if you want.
And I get an error when I try to rename the index, dask won’t let me.
I think you can use ddf.index.rename
:
import dask.dataframe as dd
ddf1 = dd.DataFrame.from_dict({
'x': range(10, 20),
'y': range(20, 30),
}, npartitions=2)
ddf1 = ddf1.set_index("y", drop=False)
ddf2 = dd.DataFrame.from_dict({
'y': range(20, 30),
'z': range(30, 40),
}, npartitions=2)
ddf_merged = dd.merge(ddf1, ddf2, left_index=True, right_on="y")
ddf_merged.index = ddf_merged.index.rename("index_renamed")
ddf_merged.compute()
Does this help answer your question?
Thanks. I worked out how to rename the index, like in your example. I still don’t know if I use dask correctly though. My workflow:
1 load data from a third party source
2 merge with the existing data to remove duplicates
3 save to parquet
4 load parquet into bigquery
I’m not sure if my step 2 of the process is really efficient:
steps for this part of the workflow
- load data from parquet (just the column I need to identify existing records:
hash
)
- merge with the loaded data from the third party source, on the hash column, while this is not the index, both dataframes have the same index (
time
), a record with the same hash has the same index (should I merge on both the time
column and hash
column to make sure this runs efficiently?
- output both a dataframe with the duplicates, and with the duplicates removed
- merge the dataframe with duplicates with other the other dataframes with records which are all linked to the
hash
column, to remove all other duplicated data.
I was also wondering why it is so hard to work with the index:
- how to rename is hard to find, and the documented method on the dataframe
rename
even takes an index parameter (which will give an error if you use it)
- the data is still there, has a column name, but doesn’t allow you to access it by column name (except when doing a merge, which is strange), which complicates the code because you can’t use the name, buy have to use “index” instead
- if I drop the column which is also in the index (technically duplicated data), I can’t access this data in other places (for example loading it into bigquery won’t load the index.
It doesn’t really answer my question, see my reply below.