Reduce topologies for general datastructures

I see tree reduce mechanism implemented for dask arrays 1. I currently operate on dask Bags and I was curious if there are existing dask reduce implementations which allow for custom reduce operations ?


NOT A CONTRIBUTION

I guess this is one implementation : dask.bag.Bag.fold — Dask documentation

I was curious if there are existing benchmarks for different kinds of reduction topologies with dask datastructures.


NOT A CONTRIBUTION

So as you’ve seen, there are Map/Reduce like operations implemented in all the high level collections that Dask offers.

I don’t think there is any benchmark, but as Dask Bag is a collection made over Python object, this is generally considered as less optimized than Arrays or Dataframes which rely on Numpy and Pandas which are C optimized libraries under the hood.

See Bag — Dask documentation

These shuffle operations are expensive and better handled by projects like dask.dataframe. It is best to use dask.bag to clean and process data, then transform it into an array or DataFrame before embarking on the more complex operations that require shuffle steps.

In this case I am mostly manipulating GPU hosted torch.Tensor objects. They exist in this form as most of the transformations utilize torch’s cuda kernels which are not available in the cudf or other Rapids libraries. However for the reduce operation in question I can utilize cudf operations.

Would you know if the overhead of transforming dask.Bag[torch.Tensors]dask.Array[float], only for the reduce operation, is worth the speed gains? My experience with performing torch.Tensor mini-batch processing in MapPartition paying the conversion cost from dataframes → Tensor for every user-function makes me believe that the conversion overhead is high.


NOT A CONTRIBUTION

Sorry, I can’t tell at all.