Dear All
I have recently started using dask and largely everything is working as expected. But for one calculation, I got some results that I did not understand, so I looked at the data via compute() first and calculated the same statistic on the computed dataframe and got different results. First, I thought these were numerical rounding errors, but testing with integers and large values confirmed the result.
Below, I uploaded an MWE. My question is, why does an operation. e.g. groupby().std() results in a different result it comes before or after .compute()
I thought it was good practice to not call compute() too early, so I tried using groupby first, but calling it first got the result I expected, so now I am trying to understand the difference.
rand_arr = np.random.randint(0,100,100_000)
repeat_arr = np.repeat(np.arange(2), 50000)
randddf = dd.from_pandas(pd.DataFrame({"rand": rand_arr, "repeat": repeat_arr}), npartitions=100)
std_rand = randddf.groupby(by=['repeat']).std(ddof=0).compute()
std_alt = randddf.compute().groupby(by=['repeat']).std(ddof=0)
print(std_rand, std_alt)
In this MWE the results differ by almost a factor of 2 in my case, so this might be just me misunderstanding behavior.
Thank you for your help