Lazy sum_labels using map_blocks with lazy index

First of all this might turn out to be more of an xarray question/issue, so apologies in advance. However it is very much related to dask and the bigger picture issue could be solved with dask.

What am I trying to do?

I am trying to write a function similar to scipy.ndimage.sum_labels, but for dask arrays (I won’t go into much detail on why I’m not using dask-image for this except I experienced some performance problems with it). The caveat is that instead of directly providing an index, I want to use da.unique to identify unique labels. Note that the labels and values arrays are results of other lazy computations.

What do I have so far?

The following is a minimal reproducible example.

import dask.array as da
import numpy as np
import xarray as xr
from scipy.ndimage import sum_labels

coords = {"y": [0, 1], "x": [0, 1, 2]}
chunks = {"y": 2, "x": 1}
values = xr.DataArray([[0.6, 0.1, 0.0], [0.2, 0.1, 0.4]], coords=coords).chunk(chunks)
labels = xr.DataArray([[1, 0, 3], [1, 1, 3]], coords=coords).chunk(chunks)

index = da.unique(labels.data)
# index.compute_chunk_sizes()
# index = index.compute()

def sum_labels_block(value_block : xr.DataArray, label_block: xr.DataArray, index) -> xr.DataArray:
    block_sum = sum_labels(value_block.values, label_block.values, index)
    block_sum = block_sum[np.newaxis, np.newaxis, :]
    # Returns an array of size (1, 1, len(index))
    return xr.DataArray(block_sum, dims=["y", "x", "index"], coords={"index": index})


y_chunks, x_chunks = map(len, values.chunks)
template_shape = (y_chunks, x_chunks, len(index))
template_chunks = (1, 1, -1)
template_values = da.zeros(shape=template_shape, dtype=values.dtype, chunks=template_chunks)
template = xr.DataArray(template_values, dims=["y", "x", "index"], coords={"index": index})

block_sums = values.map_blocks(sum_labels_block, args=(labels, index), template=template)
block_sums = block_sums.chunk(-1)
result = block_sums.sum(dim=["y", "x"])

result = result.compute()
print(result)

What is my main issue?

As it is, I get a ValueError: Cannot call len() on object with unknown chunk size.. I get that, I need to know how much space to allocate for the index dimension before computing. I can solve this in principle with a compute_chunk_sizes call. This leads to my first question:

  1. From a perspective of delaying computation as much as possible, is compute_chunk_sizes actually (much) better than just calling compute? For both I imagine all the preceding computations to arrive at the values and labels arrays must be done.

In any case, assuming compute_chunk_sizes is better than compute, I end up with an index that is a dask array. However, I still get an error: AttributeError: 'numpy.ndarray' object has no attribute '_meta'. Apparently the sum_labels_block receives an index that is also a dask array, which scipy then doesn’t seem to handle well. What I’d like is for the index in the block to be a computed numpy array (the index itself isn’t that large, the point of it being a dask array is mostly about it being lazy). Therefore my second question:

  1. Is there a way to pass a lazy index into the map_blocks func and make dask compute it before the blocks are actually computed, but not compute it eagerly (that is, between the da.unique and the values.map_blocks call). I also don’t want to call compute within sum_labels_block because as I see it it will get computed individually for every block which is inefficient.

And finally, because I worry my questions are already too narrowly focused on a certain solution:

  1. Is there in general a better approach to this problem I’m trying to solve? Preferably one that is completely lazy, i.e. can defer all computations until the final result.compute call. I’m thinking something using dask.delayed, but I’m open for anything.

Versions used for the example:
Python 3.13.5
dask 2025.7.0
xarray 2025.7.1

Hi @martijnvandermarel,

If I understand correctly your problem, then I don’t see how you can avoid computing the unique values of labels to build a unique index. I’ll try to understand your questions though.

Well, in your case, I don’t think so, because da.unique will return a single chunk in the end. This is a complete reduction. The unique gain here is that it will not be copied into a Numpy structure locally, especially if you were using a distributed cluster.
In some other cases, this might be a bit lighter because you don’t have reductions steps. But in the end, you often need to check all the chunks results to determine a shape.

I don’t think so, and anyway, this would be almost like computing the result yourself, saving maybe some network exchange on a distributed setup.

My first thought here would be to try to compute the unique values of the labels at the same time you are computing the sum. Computing partial indexes on each blocks with numpy unique, keeping them for each block, and trying to reconcile it in the end. But that sounds not easy, and maybe not feasible with dask.array or Xarray only.

About this part, did you try the dask_image.ndmeasure.sum_labels method? Did you report your problem? And maybe a look as how it is done in there might help, although it assumes an index as a sequence of int.

Thanks @guillaumeeb you for your answers!

The reason I would like to avoid computing is that this functionality will go into a function as part of a library, meaning it would generally be used as part of a longer chain of operations. As far as I understand it’s preferable to avoid computation halfway and the more I can defer computation to the end the better (so that the dask scheduler has as much freedom as possible to parallelize stuff when compute is called at the very end).

One option I’m considering is the following, where I use a full ‘dummy’ index and select the unique labels only afterwards.

if labels.dtype not in [np.uint8, np.uint16, np.int8, np.int16]:
    raise TypeError("Labels dtype must be np.uint8, np.uint16, np.int8, or np.int16")
dummy_index = np.arange(np.iinfo(dtype).min, np.iinfo(dtype).max + 1, dtype=labels.dtype)
template_shape = (y_chunks, x_chunks, len(dummy_index))
template_chunks = (1, 1, -1)
template_values = da.zeros(shape=template_shape, dtype=values.dtype, chunks=template_chunks)
template = xr.DataArray(template_values, dims=["y", "x", "index"], coords={"index": dummy_index})

block_sums = values.map_blocks(sum_labels_block, args=(labels, dummy_index), template=template)
block_sums = block_sums.chunk(-1)
result = block_sums.sum(dim=["y", "x"])

# select only the relevant indices
result = result.sel(index=unique)

I realize this may be less memory-efficient (and maybe also less computationally-efficient depending on the implementation of scipy’s sum_labels) since it will store a larger index, but I think this would be completely lazy. Of course I could also allow further reducing the dummy index if it is known in advance the label values don’t exceed a certain maximum.

(There is the point that the output of the sel now also has unknown size, meaning if the next operation in the chain requires its length or chunksize a compute is still needed and I won’t have gained anything).

I understand the optimal choice would generally depend on how it’s used. I will play around a bit with some of my common use cases to see what would work best.

Regarding dask_image.ndmeasure.sum_labels: it has been some time since I tried it but from what I remember it used the more generic labeled_comprehension under the hood which I think is a bit more complicated. A simple sum operation is easy to do in parallel on chunks while e.g. a mean operation may require more inter-chunk communication. At least that’s my hypothesis on why I experienced the map_blocks implementation being faster.

Well yes, it’s certainly a very good best practice, even if not always possible.

If you know you won’t have anything other than int16, then that seems like a nice solution!

That makes sense!

Keep us posted!