Processing array subregions

Hi, I am using Dask for processing Whole Slide Images. I have a large, (H,W, 3) shaped array and a binary mask with shape (H, W). Both the array and the mask are dask arrays. I need to apply a function to the array subregions where the mask is True. The result shape can be (H,W, 3) or (H,W).
As I understand, function map_blocks can be a candidate, but I would like to avoid reading the whole input array. I also checked piecewise, but the output must have the same shape of the input, so it is not suitable.
It is also not possible to index the array with the mask since they don’t have the same shape.

Any suggestion is welcome, thanks!

Hi @mdrio,

Could you precise things here, you’d like to read from disk only the values of the input where the binary mask is true?

Also, this would be a lot easier for us with a minimum reproducible example.

Yes, basically I read slides from file system using https://github.com/manzt/napari-lazy-openslide, which creates large dask arrays. I need to process subregions, possibly without reading the whole data. Subregions are identified by 2D binary masks, which indicates the pixels coordinates.

Here is an example that seems to work if the processing produces a 2D array (tested on dask 2022.7.1):

import dask.array as da

#image is actually read from the FS, here for simplicity's sake small, random data are used
shape = (10, 10, 3)
image = da.random.random_integers(0, 255, shape, chunks=(2, 2, 3))


#bidimensional mask used for filtering the image before processing
mask = da.zeros(shape[:2], dtype="bool")


#let's put some True values
mask[:5, :] = True


#then indices are computed, assuming they fit in memory
idx = [_.compute() for _ in da.where(mask)]


#filtering the image produces a 2D array
filtered_image = image.vindex[idx[0], idx[1]]


#image is processed using a dummy function that returns the red channel.
processed_image = filtered_image.map_blocks(lambda array: array[:,0], drop_axis=[1])


#let's get a 2D image with the red pixel where mask is True, zeros anywhere else
zeros = da.zeros(shape[:2])
result = da.piecewise(zeros, [mask], [processed_image]).compute()

Now:

  • Is it ok or there is some other trivial and more performant solution?

  • What can be a solution in case the function used in the map_blocks returns a different shaped array, for instance something like lambda array: array//2?
    In another words, I can’t find a good general solution to such a pipeline:

3D array -> filter by 2D mask -> apply generic function -> get 2D or 3D result

  • Moreover, what if the mask indices are too large so they can’t fit in memory? In that case vindex cannot be used.

Thanks in advance

Thanks for the example!

I have to admit I’m a bit puzzled. I don’t know the detail, and maybe I’m wrong, but while running it line by line in a notebook, I felt a bit surprised by the results:

  • filtered_image is of shape (50, 3), so looks like you stacked one dimension using vindex.
  • result is of shape (10, 10), is this expected?

From what I understood, I came up with this code, using Dask MaskedArray:

import dask.array as da

shape = (10, 10, 3)
image = da.random.random_integers(0, 255, shape, chunks=(2, 2, 3))

mask = da.zeros(shape[:2], dtype="bool")
mask[:5, :] = True

mask3d = da.stack([mask, mask, mask], axis=2) # little trick to have a Mask of same shape as input image

masked_image = da.ma.masked_array(image, mask3d)

processed_image = masked_image.map_blocks(lambda array: array[:,0], drop_axis=[1])

Here, masked_image looks like that:
image

And processed_image:
image

  • IO won’t be completely optimized with this solution, as you’ll have to read all the input data and the mask at some point. However, it is entirely parallelized, so I guess it should be OK.
  • You can generalized with 2D or 3D outputs, the following code works just fine:
processed_image = masked_image.map_blocks(lambda array: array // 2)
  • mask3D is a Dask Array, so it should work even if it doesn’t fit in memory!

Let me know if what I propose here makes sense to you!

Yes, the result can be (10, 10, 3) in the case 3D → 3D transformation, or (10, 10) for 3D → 2D transformation.

Yes, it does make sense. I also was not considering the masked arrays, so thanks. Probably I will try a flexible solution where the vindex approach or even numpy is used when the data to process are very small compared to the image array, otherwise I will go with your approach.

Thanks again for the feedback!

Oh yeah, you’re right, my mistake. I was only surprised by the intermediate values.

Just one more thing, in this small example, when you used the vindex function, you lost chunking:
image

I’m not sure of the behavior on bigger images, but without chunking, your code won’t be parallelized, and you might run into memory problems.

ok, I didn’t realize chunk size was changed.

Not sure if I should open a new thread, anyway investigating about performance when reading from file, I noticed a behavior I did not expected regarding the function where. Suppose one has a zarr array:

import dask.array as da
import zarr

shape = (1000, 1000)
chunks = (100, 100)
z1 = zarr.open('test.zarr', mode='w', shape=shape,
               chunks=chunks, dtype='int')

a1 = da.from_zarr(z1)

Suppose using where on a full zeros condition and on a full ones condition.

da.where(m1, da.zeros(shape, chunks=chunks), a1,  0).compute()

#or

da.where(da.ones(shape, chunks=chunks), a1,  0).compute()

In the first case, with the zeros condition, I expected Dask to not read the zarr array at all, instead I generated a report in both case and the disk-read time is more or less the same.

Am I missing something about how where works? Does it exist another similar function that read the array when the condition is True?

Well, I don’t think Dask is clever enough to read only the data it needs with the where function. As I understand it, where is only some conditional applied to return values from an array or another one. I expect it would be very difficult to make Dask aware of the workflow before where call in order to prevent it to read just some parts of an array.

Ok I understand, thanks.