List of Dask Dataframe operations that could be run in parallel without using map_partitions

slepeturin · December 2, 2024, 10:56pm

Where I could find:
List of Dask Dataframe operations that could be run in parallel without using map_partitions

guillaumeeb · December 4, 2024, 4:03pm

Do you mean Dask DataFrame API with Logical Query Planning — Dask documentation?

slepeturin · December 4, 2024, 8:51pm

Thanks Guillaume for your quick reply!

I ask about list of the DASK operations that are guarantee to be run in parallel.
As example, as below pairs run in parallel and have the performance?

dd_dask_df = dd_dask_df_in.groupby(['id','lvl']).agg(fdate_max{'fdate': 'max'}).reset_index(level =['id','level'])
dd_dask_df = dd_dask_df_in.map_partitions(lambda ddf: ddf.groupby([ddf.index,'id','lvl']).agg( fdate_max=('fdate', 'max')).reset_index(level =['id','level']),meta = this_meta)

dd_dask_df_A = dd_dask_df_A.map_partitions(lambda df: df[df['col1'].isnull()])
dd_dask_df_A = dd_dask_df_A[ dd_dask_df['col1'].isnull() ]

For my logic, I need that all rows with the same index be in one partitions.
For this purpose, logic would set index with predefined division.
But let say, after left merge divisions are lost, so I would need reset index.
As

   (
  ddf1.merge(ddf2, how=’left’, ,left_index=True, right_index=True, …)
         .reset_index(drop = True)
         .set_index(‘id',drop=False, division=sorted(the_divisions))
         .rename_axis(index='idx')
   )

How to force DASK to keep original divisions?

Do we have list of DASK operations that keep original divisions?
For example would original divisions be kept after:

    dd_dask_df_A = dd_dask_df_A.map_partitions(lambda df: df[ (df['col1'].isnull())])
    dd_dask_df_A = dd_dask_df_A[ (dd_dask_df = dd_dask_df['col1'].isnull())

If we need use only sub-set of dask dataframe columns, performance wise ,are below the same:

      df1=df2[['col1','col2']]
      df1= df2.loc[:,['col1','col2']]

guillaumeeb · December 6, 2024, 6:20pm

Well, in general, all operations on Dask collection will be run in parallel. However, some operations like groupby or merge are complex to implement in a distributed context, they often require Map Reduce style operations with full shuffle of Data around workers and partitions.

slepeturin:

dd_dask_df = dd_dask_df_in.groupby(['id','lvl']).agg(fdate_max{'fdate': 'max'}).reset_index(level =['id','level'])
dd_dask_df = dd_dask_df_in.map_partitions(lambda ddf: ddf.groupby([ddf.index,'id','lvl']).agg( fdate_max=('fdate', 'max')).reset_index(level =['id','level']),meta = this_meta)

Those two operations will not have the same result, the first is a full groupby on all the dataset, the second is only a groupBy on each partition. But yes, they will run in parallel.

These two are equivalent.

Well, if you already have correct divisions in ddf1, then the easiest would probably to apply the same divisions on ddf2 before joining. This way divisions will already be aligned, and kept.

As soon as you don’t perform operation that repartition your data, like joins or other, then partitions and divisions will always be kept.

I guess so, but I have to admit I don’t know.

slepeturin · December 6, 2024, 7:23pm

Thank so much Guillaume for your quick response! - It clarifies things for me.

Topic		Replies	Views
Perform the same operation on all columns of a dask dataframe in parallel Dask DataFrame delayed , distributed , dask-ml	5	216	November 10, 2022
How does Dask determine partitions? Dask DataFrame partitioning , distributed	2	551	January 24, 2023
Doubts related Dask dataframe Dask DataFrame	3	385	February 14, 2022
Best Practice for converting a function that takes multiple pandas dataframes into one that takes multiple dask dataframes? Dask DataFrame	1	157	January 25, 2023
How to handle a Dask DF in multiple modules? Dask DataFrame	6	574	February 8, 2023

List of Dask Dataframe operations that could be run in parallel without using map_partitions

Related topics