Order of .compute() and .groupby() impacting results

ChrisKInfo · March 14, 2025, 12:04pm

Dear All

I have recently started using dask and largely everything is working as expected. But for one calculation, I got some results that I did not understand, so I looked at the data via compute() first and calculated the same statistic on the computed dataframe and got different results. First, I thought these were numerical rounding errors, but testing with integers and large values confirmed the result.

Below, I uploaded an MWE. My question is, why does an operation. e.g. groupby().std() results in a different result it comes before or after .compute()

I thought it was good practice to not call compute() too early, so I tried using groupby first, but calling it first got the result I expected, so now I am trying to understand the difference.

rand_arr = np.random.randint(0,100,100_000)
repeat_arr = np.repeat(np.arange(2), 50000)
randddf = dd.from_pandas(pd.DataFrame({"rand": rand_arr, "repeat": repeat_arr}), npartitions=100)
std_rand = randddf.groupby(by=['repeat']).std(ddof=0).compute()
std_alt = randddf.compute().groupby(by=['repeat']).std(ddof=0)
print(std_rand, std_alt)

In this MWE the results differ by almost a factor of 2 in my case, so this might be just me misunderstanding behavior.

Thank you for your help

guillaumeeb · March 14, 2025, 1:10pm

Hi @ChrisKInfo, welcome to Dask Discourse forum!

I just try your reproducer, and I’m optaining the same results with both approaches.

Which Dask versions are you using?

ChrisKInfo · March 14, 2025, 5:01pm

Thank you @guillaumeeb for looking into this. This problem occurs, for me, on two separate machines/environments, with Python 3.8.2 and Dask 2023.5.0 and Python 3.11.11 and Dask 2024.12.1

This is the entire code, including imports (Cursor AI or VSC, both on Win 10/11 machines)

import numpy as np
import pandas as pd
import dask
print("Dask version:", dask.__version__)
import dask.dataframe as dd
dask.config.set(scheduler='threads') 
dask.config.set(temporary_directory=None)
dask.config.set(memory_limit='32GB')

# %%
rand_arr = np.random.randint(0,100,100_000)
repeat_arr = np.repeat(np.arange(2), 50000)
randddf = dd.from_pandas(pd.DataFrame({"rand": rand_arr, "repeat": repeat_arr}), npartitions=100)
std_rand = randddf.groupby(by=['repeat']).std(ddof=0).compute()
std_alt = randddf.compute().groupby(by=['repeat']).std(ddof=0)
print(std_rand, std_alt)

guillaumeeb · March 14, 2025, 6:01pm

Just tried this code on a Linux RH8 environment, working fine…

Dask version: 2024.2.1
repeat           
0       28.940246
1       28.888514              rand
repeat           
0       28.940246
1       28.888514

Note sure why it fails on Windows.

Topic		Replies	Views
Inconsistent built-in function behavior after groupby Dask DataFrame	1	12	June 13, 2025
How to check that a dataframe is properly built? Dask DataFrame	3	46	November 27, 2024
Computations happen before .compute() - expected behavior? Deploying Dask	2	158	August 3, 2023
Dask group_by and getting the unique column count is taking a lot of time Dask DataFrame optimization , groupby , aggregation	4	682	January 2, 2024
Question: if I am mixing dask.delayed functions and using dask dataframes, are there any caveats to be aware of? Dask DataFrame delayed	5	722	August 21, 2023

Order of .compute() and .groupby() impacting results

Related topics