How groubyied Dataframe works?

Richard · February 14, 2023, 10:22am

Hi, How partitioning in the groupby works in the Dask?
Let’s say that I have pretty huge dataframe, and I want to use statsmodels

Problem with statsmodels is that, it supports only Pandas dataframe. As I understand If I convert Dask DF to Pandas DF, it means that from cluster perspective all data from the Nodes is gathered into one point (main node)…

So, my idea is that I will apply the groupby function (or other function) on Dask DF and the Grouped DF will be converted to Pandas DF and then used in statsmodels?

The question is, can someone elaborate how the DF will behave in that case? Maybe someone has a better idea?

guillaumeeb · February 14, 2023, 7:55pm

Hi @Richard, welcome here!

True! To be really precise, it is gathered on the node where the you launched the computation (where the Client is if using Distributed).

What you are looking for is this:
https://examples.dask.org/dataframes/02-groupby.html#Groupby-Apply

Feel free to ask if you have more question or if I did not understand correctly.

Richard · February 16, 2023, 4:18pm

Hi @guillaumeeb

Thank you very much for reply.

From the example:

def group_test(data_ddf):
    pandas_df = data_ddf.compute() # Where it is executed?
   # other operations


df.groupby('col').apply(group_test, meta=object).compute()

The data for the pandas_df are also collected on the main node?
As I understand Dask has row based partitioning, so If yes (collected on main node), why it is happening? Why cannot be in the collect on other (worker) nodes?

Richard · February 16, 2023, 5:43pm

Or is there any comparison between Dask groupby and Spark UDF?

Richard · February 16, 2023, 6:02pm

I mean I am looking for similar answer but for Dask pyspark - Does the User Defined Functions (UDF) in SPARK works in a distributed way? - Stack Overflow

guillaumeeb · February 16, 2023, 7:03pm

I think your example should look like that:

def group_test(panda_df):
    # The input to the function is a Pandas dataframe, because it's only one partition of the Dask Dataframe
    # other operations

df.groupby('col').apply(group_test, meta=object).compute()

So no the group_test function will be applied on Worker nodes, each group_test task will take a Pandas Dataframe into input, this Pandas DataFrame being one partition of the Dask DataFrame.

So it’s not collected, the partition is just loaded onto a Worker node for processing.

So yes, this operation will work on a distributed way!

However, your last line with apply().compute() will gather all the result in the main node. You might prefer to just write the results on a file system or object store from the workers, using options described here:

Richard · February 17, 2023, 7:18am

Thank you very much

Topic		Replies	Views
Operations on a partitioned DataFrame not actually distributed across workers Dask DataFrame distributed	4	325	May 13, 2022
Doubts related Dask dataframe Dask DataFrame	3	384	February 14, 2022
Dask Local Distributed vs Dataframe Distributed	1	54	August 28, 2024
Inconsistent built-in function behavior after groupby Dask DataFrame	1	9	June 13, 2025
Using all cores on a large VM (112 cores/1TB ram) when grouping by on a dask dataframe Dask DataFrame	3	232	August 9, 2023

How groubyied Dataframe works?

Related topics