Order of .compute() and .groupby() impacting results

Dear All

I have recently started using dask and largely everything is working as expected. But for one calculation, I got some results that I did not understand, so I looked at the data via compute() first and calculated the same statistic on the computed dataframe and got different results. First, I thought these were numerical rounding errors, but testing with integers and large values confirmed the result.

Below, I uploaded an MWE. My question is, why does an operation. e.g. groupby().std() results in a different result it comes before or after .compute()

I thought it was good practice to not call compute() too early, so I tried using groupby first, but calling it first got the result I expected, so now I am trying to understand the difference.

rand_arr = np.random.randint(0,100,100_000)
repeat_arr = np.repeat(np.arange(2), 50000)
randddf = dd.from_pandas(pd.DataFrame({"rand": rand_arr, "repeat": repeat_arr}), npartitions=100)
std_rand = randddf.groupby(by=['repeat']).std(ddof=0).compute()
std_alt = randddf.compute().groupby(by=['repeat']).std(ddof=0)
print(std_rand, std_alt)

In this MWE the results differ by almost a factor of 2 in my case, so this might be just me misunderstanding behavior.

Thank you for your help

Hi @ChrisKInfo, welcome to Dask Discourse forum!

I just try your reproducer, and I’m optaining the same results with both approaches.

Which Dask versions are you using?

Thank you @guillaumeeb for looking into this. This problem occurs, for me, on two separate machines/environments, with Python 3.8.2 and Dask 2023.5.0 and Python 3.11.11 and Dask 2024.12.1

This is the entire code, including imports (Cursor AI or VSC, both on Win 10/11 machines)

import numpy as np
import pandas as pd
import dask
print("Dask version:", dask.__version__)
import dask.dataframe as dd
dask.config.set(scheduler='threads') 
dask.config.set(temporary_directory=None)
dask.config.set(memory_limit='32GB')

# %%
rand_arr = np.random.randint(0,100,100_000)
repeat_arr = np.repeat(np.arange(2), 50000)
randddf = dd.from_pandas(pd.DataFrame({"rand": rand_arr, "repeat": repeat_arr}), npartitions=100)
std_rand = randddf.groupby(by=['repeat']).std(ddof=0).compute()
std_alt = randddf.compute().groupby(by=['repeat']).std(ddof=0)
print(std_rand, std_alt)

Just tried this code on a Linux RH8 environment, working fine…

Dask version: 2024.2.1
repeat           
0       28.940246
1       28.888514              rand
repeat           
0       28.940246
1       28.888514

Note sure why it fails on Windows.