Doubts related Dask dataframe

vigneshn1997 · January 29, 2022, 5:22am

I had 2 doubts regarding dask dataframe API

Is there any difference between dataframe apply and map_partitions functions in terms of performance?
df.groupby(‘reviewerID’)[‘overall’].count(split_out=4, split_every=False) : what do the split_out and split_every arguments signify? I could not get the details about the arguments from the documentation.

pavithraes · January 31, 2022, 6:18pm

@vigneshn1997 Those are both good questions!

apply calls map_partitions internally, so they should both give you similar performance.

As for split_out and split_every:

For some context, Dask’s task graphs have a reduction step for groupby aggregations.
split_out lets you set the number of partitions your output should have, and
split_every helps Dask to compute how many layers there are in the tree (more concretely, each parent node, will have at most the number of children defined by split_every).

Note that you wouldn’t need to tune these manually in most cases.

Here is an example that shows the task graphs generated by using these:

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1,2,1,2,1,2,1,2,1,2], 'y': range(10)})
ddf = dd.from_pandas(df, npartitions=5)

a = ddf.groupby('x').y.sum()
a.visualize()

b = ddf.groupby('x').y.sum(split_out=2)
b.visualize()

c = ddf.groupby('x').y.sum(split_every=3)
c.visualize()

vigneshn1997 · January 31, 2022, 11:16pm

Thank you @pavithraes for the detailed answer. This is really helpful for me.

scharlottej13 · February 14, 2022, 6:00pm

A post was split to a new topic: Map_partitions question for image processing

Topic		Replies	Views
Map_partitions question for image processing Dask DataFrame	6	838	February 21, 2022
How groubyied Dataframe works? Dask DataFrame distributed	6	283	February 17, 2023
Using all cores on a large VM (112 cores/1TB ram) when grouping by on a dask dataframe Dask DataFrame	3	236	August 9, 2023
Row processing: map_partitions vs apply	1	322	March 28, 2022
Operations on a partitioned DataFrame not actually distributed across workers Dask DataFrame distributed	4	325	May 13, 2022

Doubts related Dask dataframe

Related topics