Processing array subregions

mdrio · March 31, 2023, 4:13pm

Hi, I am using Dask for processing Whole Slide Images. I have a large, (H,W, 3) shaped array and a binary mask with shape (H, W). Both the array and the mask are dask arrays. I need to apply a function to the array subregions where the mask is True. The result shape can be (H,W, 3) or (H,W).
As I understand, function map_blocks can be a candidate, but I would like to avoid reading the whole input array. I also checked piecewise, but the output must have the same shape of the input, so it is not suitable.
It is also not possible to index the array with the mask since they don’t have the same shape.

Any suggestion is welcome, thanks!

guillaumeeb · March 31, 2023, 7:29pm

Hi @mdrio,

Could you precise things here, you’d like to read from disk only the values of the input where the binary mask is true?

Also, this would be a lot easier for us with a minimum reproducible example.

mdrio · April 3, 2023, 2:10pm

Yes, basically I read slides from file system using https://github.com/manzt/napari-lazy-openslide, which creates large dask arrays. I need to process subregions, possibly without reading the whole data. Subregions are identified by 2D binary masks, which indicates the pixels coordinates.

Here is an example that seems to work if the processing produces a 2D array (tested on dask 2022.7.1):

import dask.array as da

#image is actually read from the FS, here for simplicity's sake small, random data are used
shape = (10, 10, 3)
image = da.random.random_integers(0, 255, shape, chunks=(2, 2, 3))


#bidimensional mask used for filtering the image before processing
mask = da.zeros(shape[:2], dtype="bool")


#let's put some True values
mask[:5, :] = True


#then indices are computed, assuming they fit in memory
idx = [_.compute() for _ in da.where(mask)]


#filtering the image produces a 2D array
filtered_image = image.vindex[idx[0], idx[1]]


#image is processed using a dummy function that returns the red channel.
processed_image = filtered_image.map_blocks(lambda array: array[:,0], drop_axis=[1])


#let's get a 2D image with the red pixel where mask is True, zeros anywhere else
zeros = da.zeros(shape[:2])
result = da.piecewise(zeros, [mask], [processed_image]).compute()

Now:

Is it ok or there is some other trivial and more performant solution?
What can be a solution in case the function used in the map_blocks returns a different shaped array, for instance something like lambda array: array//2?
In another words, I can’t find a good general solution to such a pipeline:

3D array -> filter by 2D mask -> apply generic function -> get 2D or 3D result

Moreover, what if the mask indices are too large so they can’t fit in memory? In that case vindex cannot be used.

Thanks in advance

guillaumeeb · April 3, 2023, 5:31pm

Thanks for the example!

I have to admit I’m a bit puzzled. I don’t know the detail, and maybe I’m wrong, but while running it line by line in a notebook, I felt a bit surprised by the results:

filtered_image is of shape (50, 3), so looks like you stacked one dimension using vindex.
result is of shape (10, 10), is this expected?

From what I understood, I came up with this code, using Dask MaskedArray:

import dask.array as da

shape = (10, 10, 3)
image = da.random.random_integers(0, 255, shape, chunks=(2, 2, 3))

mask = da.zeros(shape[:2], dtype="bool")
mask[:5, :] = True

mask3d = da.stack([mask, mask, mask], axis=2) # little trick to have a Mask of same shape as input image

masked_image = da.ma.masked_array(image, mask3d)

processed_image = masked_image.map_blocks(lambda array: array[:,0], drop_axis=[1])

Here, masked_image looks like that:

And processed_image:

IO won’t be completely optimized with this solution, as you’ll have to read all the input data and the mask at some point. However, it is entirely parallelized, so I guess it should be OK.
You can generalized with 2D or 3D outputs, the following code works just fine:

processed_image = masked_image.map_blocks(lambda array: array // 2)

mask3D is a Dask Array, so it should work even if it doesn’t fit in memory!

Let me know if what I propose here makes sense to you!

mdrio · April 4, 2023, 8:52am

Yes, the result can be (10, 10, 3) in the case 3D → 3D transformation, or (10, 10) for 3D → 2D transformation.

Yes, it does make sense. I also was not considering the masked arrays, so thanks. Probably I will try a flexible solution where the vindex approach or even numpy is used when the data to process are very small compared to the image array, otherwise I will go with your approach.

Thanks again for the feedback!

guillaumeeb · April 4, 2023, 11:00am

Oh yeah, you’re right, my mistake. I was only surprised by the intermediate values.

Just one more thing, in this small example, when you used the vindex function, you lost chunking:

I’m not sure of the behavior on bigger images, but without chunking, your code won’t be parallelized, and you might run into memory problems.

mdrio · April 4, 2023, 12:01pm

ok, I didn’t realize chunk size was changed.

mdrio · April 7, 2023, 6:13pm

Not sure if I should open a new thread, anyway investigating about performance when reading from file, I noticed a behavior I did not expected regarding the function where. Suppose one has a zarr array:

import dask.array as da
import zarr

shape = (1000, 1000)
chunks = (100, 100)
z1 = zarr.open('test.zarr', mode='w', shape=shape,
               chunks=chunks, dtype='int')

a1 = da.from_zarr(z1)

Suppose using where on a full zeros condition and on a full ones condition.

da.where(m1, da.zeros(shape, chunks=chunks), a1,  0).compute()

#or

da.where(da.ones(shape, chunks=chunks), a1,  0).compute()

In the first case, with the zeros condition, I expected Dask to not read the zarr array at all, instead I generated a report in both case and the disk-read time is more or less the same.

Am I missing something about how where works? Does it exist another similar function that read the array when the condition is True?

guillaumeeb · April 10, 2023, 5:08pm

Well, I don’t think Dask is clever enough to read only the data it needs with the where function. As I understand it, where is only some conditional applied to return values from an array or another one. I expect it would be very difficult to make Dask aware of the workflow before where call in order to prevent it to read just some parts of an array.

mdrio · April 11, 2023, 8:46am

Ok I understand, thanks.

Topic		Replies	Views
Best way to process a large array given a smaller binary mask Dask Array	0	203	October 26, 2022
Using da.delayed for Zarr processing: memory overhead & how to do it better? Dask Array dask-array , delayed	15	1835	January 16, 2024
Indexing a dask array with a boolean array Dask Array dask-array	2	1725	May 19, 2022
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	626	July 31, 2023
Change array shape with map_block function Dask Array	1	139	November 16, 2023

Processing array subregions

Related topics