Where I could find:
List of Dask Dataframe operations that could be run in parallel without using map_partitions
Thanks Guillaume for your quick reply!
- I ask about list of the DASK operations that are guarantee to be run in parallel.
As example, as below pairs run in parallel and have the performance?
dd_dask_df = dd_dask_df_in.groupby(['id','lvl']).agg(fdate_max{'fdate': 'max'}).reset_index(level =['id','level'])
dd_dask_df = dd_dask_df_in.map_partitions(lambda ddf: ddf.groupby([ddf.index,'id','lvl']).agg( fdate_max=('fdate', 'max')).reset_index(level =['id','level']),meta = this_meta)
dd_dask_df_A = dd_dask_df_A.map_partitions(lambda df: df[df['col1'].isnull()])
dd_dask_df_A = dd_dask_df_A[ dd_dask_df['col1'].isnull() ]
- For my logic, I need that all rows with the same index be in one partitions.
For this purpose, logic would set index with predefined division.
But let say, after left merge divisions are lost, so I would need reset index.
As
(
ddf1.merge(ddf2, how=’left’, ,left_index=True, right_index=True, …)
.reset_index(drop = True)
.set_index(‘id',drop=False, division=sorted(the_divisions))
.rename_axis(index='idx')
)
How to force DASK to keep original divisions?
Do we have list of DASK operations that keep original divisions?
For example would original divisions be kept after:
dd_dask_df_A = dd_dask_df_A.map_partitions(lambda df: df[ (df['col1'].isnull())])
dd_dask_df_A = dd_dask_df_A[ (dd_dask_df = dd_dask_df['col1'].isnull())
- If we need use only sub-set of dask dataframe columns, performance wise ,are below the same:
df1=df2[['col1','col2']]
df1= df2.loc[:,['col1','col2']]
Well, in general, all operations on Dask collection will be run in parallel. However, some operations like groupby or merge are complex to implement in a distributed context, they often require Map Reduce style operations with full shuffle of Data around workers and partitions.
Those two operations will not have the same result, the first is a full groupby on all the dataset, the second is only a groupBy on each partition. But yes, they will run in parallel.
These two are equivalent.
Well, if you already have correct divisions in ddf1, then the easiest would probably to apply the same divisions on ddf2 before joining. This way divisions will already be aligned, and kept.
As soon as you don’t perform operation that repartition your data, like joins or other, then partitions and divisions will always be kept.
I guess so, but I have to admit I don’t know.
Thank so much Guillaume for your quick response! - It clarifies things for me.