`var` and `std` with ddof in groupby context with other aggregations

FBruzzesi · December 23, 2024, 9:37am

Hey there!

Suppose I want to compute variance or standard deviation with non-default ddof in a groupby context, I can do:

df.groupby("a")["b"].var(ddof=2)

However, if I want that to happen together with other aggregations such as:

df.groupby("a").agg(b_var = ("b", "var"), c_sum = ("c", "sum"))

my understanding is that to be able to have non default ddof I should create a custom aggregation.

Here what I got so far:

def var(ddof: int = 1) -> dd.Aggregation:
    import dask.dataframe as dd

    return dd.Aggregation(
        name="var",
        chunk=lambda s: (s.count(), s.sum(), (s.pow(2)).sum()),
        agg=lambda count, sum_, sum_sq: (count.sum(), sum_.sum(), sum_sq.sum()),
        finalize=lambda count, sum_, sum_sq: (sum_sq - (sum_ ** 2 / count)) / (count - ddof),
    )

Yet, I encounter a RuntimeError:

df.groupby("a").agg({"b": var(2)})

RuntimeError('Failed to generate metadata for DecomposableGroupbyAggregation(frame=df, arg={‘b’: <dask.dataframe.groupby.Aggregation object at 0x7fdfb8469910>}

What am I missing? Is there a better way to achieve this?

Full script:

import dask.dataframe as dd

data = {
    "a": [1, 1, 1, 1, 2, 2, 2],
    "b": range(7),
    "c": range(10, 3, -1),
}

df = dd.from_dict(data, 2)

def var(ddof: int = 1) -> dd.Aggregation:
    import dask.dataframe as dd

    return dd.Aggregation(
        name="var",
        chunk=lambda s: (s.count(), s.sum(), (s.pow(2)).sum()),
        agg=lambda count, sum_, sum_sq: (count.sum(), sum_.sum(), sum_sq.sum()),
        finalize=lambda count, sum_, sum_sq: (sum_sq - (sum_ ** 2 / count)) / (count - ddof),
    )

df.groupby("a").agg(b_var = ("b", "var"), c_sum = ("c", "sum"))  # <- no issue

df.groupby("a").agg(b_var = ("b", var(2)), c_sum = ("c", "sum"))  # <- RuntimeError

guillaumeeb · December 29, 2024, 2:10pm

Hi @FBruzzesi,

Your problem probably comes from what is explained in this topic.

One simpler solution would be to use functools.partial:

import functools
var_ddof_2 = functools.partial(dd.groupby.DataFrameGroupBy.var, ddof=2)
df.groupby("a").agg(b_var = ("b", var_ddof_2), c_sum = ("c", "sum"))

FBruzzesi · December 29, 2024, 7:34pm

Thanks a ton @guillaumeeb. I was probably overcomplicating things, your solution seems much easier! Just to confirm, is there any loss in the optimization?

I just opened a PR in narwhals, which is where the issue originated.

I am currently getting an error for the nightly dask build as dd.groupby.DataFrameGroupBy is not found/available. Should I use dask_expr._groupby.GroupBy from the next release forward?

guillaumeeb · January 3, 2025, 2:32pm

I can’t see where there could be, but I might be mistaken.

You are right. I have to admit I don’t know what would be the right import to use now…

FBruzzesi · January 3, 2025, 9:43pm

I can’t see where there could be, but I might be mistaken.

Amazing! Thank you!

You are right. I have to admit I don’t know what would be the right import to use now…

For my use case, the minimum dask_expr installed with dask[dataframe] will already have dask_expr._groupby.GroupBy available, thus I switched to use such class.

Thanks and kudos for your help!

Topic		Replies	Views
Custom aggregation of dask dataframe Dask DataFrame	7	582	March 27, 2024
dataframe.groupby.Aggregation has dataframe populated with foo or 1 Dask DataFrame aggregation	0	196	November 11, 2022
Is it possible to use custom Aggregation over entire ddf/column? Dask DataFrame	1	243	May 9, 2022
Implementing custom lambda function in Dask Dask DataFrame dask-array , distributed	1	1086	March 17, 2023
How to get groupby group names with Dask.Dataframes Dask DataFrame	7	1327	February 11, 2022

`var` and `std` with ddof in groupby context with other aggregations

Related topics