Inconsistencies with Dask Columns & Indices

L_Martin · January 23, 2025, 2:14pm

Hello there, I’m quite new with Dask, so I may be asking a question which is pretty trivial or I may simply be doing things horribly wrong.
So, I am working with a Dask DataFrame which has, as columns, ORBIT, chamber, plus a handful more.
Now, what I need to do is to first groupby the Dask DataFrame by the columns ORBIT and chamber, then apply a function to the groups. At this point, I have to take the resulting Dask DataFrame, group it by ORBIT once again, and apply a second funciton. In short, this is what I am doing:

results_intermediate = ddf.groupby(['ORBIT', 'chamber'])\
                          .apply(func1, meta = meta 1)\
                          .reset_index()

results_final = results_intermediate.groupby(['ORBIT'])\
                                    .apply(func2, axis = 1, meta = meta2)

And I get the following error:

KeyError: 'ORBIT'

suggesting that ORBIT is, indeed, not a column of results_intermediate.
What happens, apparently, is that when I do reset_index(), the columns ORBIT, chamber and a new one that is defined while performing the whole operation (for me it’s named level_2) are all collapsed into a single column named index. This odd behavior is completely removed when working with the Pandas DataFrame: whenever I try to use .head() or .compute() the columns contained into index are now separate into ORBIT, chamber, and level_2.
Since I have to access the column ORBIT from the Dask DataFrame (for the computation of results_final) I was wondering how I can solve this issue.

Thank you very much in advance!

guillaumeeb · January 24, 2025, 10:27pm

Hi @L_Martin, welcome to Dask community!

Okay, it took me some times, but I think I figured the issue: problems probably comes from meta argument construction. I built a small example where it seems meta was correctly inferred, but whenever I tried to set it to give a real name to the output column, everything failed.

This github issue gave me the trick.

import dask
import pandas as pd
df = dask.datasets.timeseries()

def func1(partition):
    return partition.x.sum()

def func2(partition):
    return partition.Sum.mean()

meta = pd.Series(name='Sum', dtype='f8', index=pd.MultiIndex([[], []], [[], []], names=['name', 'id']))
results_intermediate = df.groupby(['name', 'id'])\
                         .apply(func1, meta=meta)\
                         .reset_index()

results_final = results_intermediate.reset_index().groupby(['name'])\
                                    .apply(func2)

results_final.compute()

I’m not really certain meta should be that complicated to use here… Maybe you should open an issue with this reproducer.

L_Martin · January 25, 2025, 1:19pm

Thank you very much @guillaumeeb! Yeah, the behavior of meta seems to be very odd in this case, I tried some other combinations and it still didn’t work, but this time the error that was returned was way more addressed towards the meta argument indeed. I will make sure to add more details about these different errors I got.
Anyway, thank you again!

guillaumeeb · January 29, 2025, 12:33pm

Hi @L_Martin, did you end up opening an issue or not yet? Do you intend to or should I do it? I think this at least needs some clarification in the documentation.

L_Martin · January 29, 2025, 1:47pm

To be fair, I haven’t opened it yet. I am quite new to these things, so I am not sure how to properly do it, but I wouldn’t mind trying it out myself

guillaumeeb · January 31, 2025, 3:57pm

No problem, do not hesitate to try and no hurry. You can ping me if you want. When opening the github issue tracker at GitHub · Where software is built, you’ll be able to create a new issue of Bug Report type, and you’ll see some template. Try to fill all the part if you can, but the reproducer above is the most important point!

Topic		Replies	Views
Actual and meta columns mismatch Dask DataFrame	2	1282	July 22, 2022
Index name changed after groupby() and apply() and missing column Dask DataFrame	0	1654	November 21, 2022
DataFrame created by DataFrame.apply() Dask DataFrame	1	2216	April 27, 2022
Df[cols].drop_duplicates().compute() causes ValueError: The columns in the computed data do not match the columns in the provided metadata Dask DataFrame	1	36	February 14, 2025
Meta='int' failed Dask DataFrame	1	220	January 15, 2022

Inconsistencies with Dask Columns & Indices

Related topics