Reduce topologies for general datastructures

vij · July 19, 2023, 10:14pm

I see tree reduce mechanism implemented for dask arrays 1. I currently operate on dask Bags and I was curious if there are existing dask reduce implementations which allow for custom reduce operations ?

NOT A CONTRIBUTION

vij · July 19, 2023, 10:20pm

I guess this is one implementation : dask.bag.Bag.fold — Dask documentation

I was curious if there are existing benchmarks for different kinds of reduction topologies with dask datastructures.

NOT A CONTRIBUTION

guillaumeeb · July 20, 2023, 11:58am

So as you’ve seen, there are Map/Reduce like operations implemented in all the high level collections that Dask offers.

I don’t think there is any benchmark, but as Dask Bag is a collection made over Python object, this is generally considered as less optimized than Arrays or Dataframes which rely on Numpy and Pandas which are C optimized libraries under the hood.

See Bag — Dask documentation

These shuffle operations are expensive and better handled by projects like dask.dataframe. It is best to use dask.bag to clean and process data, then transform it into an array or DataFrame before embarking on the more complex operations that require shuffle steps.

vij · July 20, 2023, 2:33pm

In this case I am mostly manipulating GPU hosted torch.Tensor objects. They exist in this form as most of the transformations utilize torch’s cuda kernels which are not available in the cudf or other Rapids libraries. However for the reduce operation in question I can utilize cudf operations.

Would you know if the overhead of transforming dask.Bag[torch.Tensors] → dask.Array[float], only for the reduce operation, is worth the speed gains? My experience with performing torch.Tensor mini-batch processing in MapPartition paying the conversion cost from dataframes → Tensor for every user-function makes me believe that the conversion overhead is high.

NOT A CONTRIBUTION

guillaumeeb · July 20, 2023, 2:50pm

Sorry, I can’t tell at all.

Topic		Replies	Views
Understanding Dask best practices to avoid excessive object creation and GC collection when using Dask Bags Dask Bag distributed	4	1227	December 14, 2021
Reduction over futures Distributed	1	18	July 17, 2024
Optimizing Dask Delayed Pandas DataFrames for Large-Scale Data Processing - Emmanuel Katto Dask DataFrame delayed	3	78	September 19, 2024
About the Dask Bag category Dask Bag dask-bag	0	311	October 22, 2021
Dask.array.reduction instead of da.map_blocks? Dask Array dask-array	2	142	June 7, 2024

Reduce topologies for general datastructures

Related topics