Hi, I am using Dask for processing Whole Slide Images. I have a large, (H,W, 3) shaped array and a binary mask with shape (H, W). Both the array and the mask are dask arrays. I need to apply a function to the array subregions where the mask is True. The result shape can be (H,W, 3) or (H,W).
As I understand, function map_blocks can be a candidate, but I would like to avoid reading the whole input array. I also checked piecewise, but the output must have the same shape of the input, so it is not suitable.
It is also not possible to index the array with the mask since they don’t have the same shape.
Yes, basically I read slides from file system using https://github.com/manzt/napari-lazy-openslide, which creates large dask arrays. I need to process subregions, possibly without reading the whole data. Subregions are identified by 2D binary masks, which indicates the pixels coordinates.
Here is an example that seems to work if the processing produces a 2D array (tested on dask 2022.7.1):
import dask.array as da
#image is actually read from the FS, here for simplicity's sake small, random data are used
shape = (10, 10, 3)
image = da.random.random_integers(0, 255, shape, chunks=(2, 2, 3))
#bidimensional mask used for filtering the image before processing
mask = da.zeros(shape[:2], dtype="bool")
#let's put some True values
mask[:5, :] = True
#then indices are computed, assuming they fit in memory
idx = [_.compute() for _ in da.where(mask)]
#filtering the image produces a 2D array
filtered_image = image.vindex[idx, idx]
#image is processed using a dummy function that returns the red channel.
processed_image = filtered_image.map_blocks(lambda array: array[:,0], drop_axis=)
#let's get a 2D image with the red pixel where mask is True, zeros anywhere else
zeros = da.zeros(shape[:2])
result = da.piecewise(zeros, [mask], [processed_image]).compute()
Is it ok or there is some other trivial and more performant solution?
What can be a solution in case the function used in the map_blocks returns a different shaped array, for instance something like lambda array: array//2?
In another words, I can’t find a good general solution to such a pipeline:
3D array -> filter by 2D mask -> apply generic function -> get 2D or 3D result
Moreover, what if the mask indices are too large so they can’t fit in memory? In that case vindex cannot be used.
Yes, the result can be (10, 10, 3) in the case 3D → 3D transformation, or (10, 10) for 3D → 2D transformation.
Yes, it does make sense. I also was not considering the masked arrays, so thanks. Probably I will try a flexible solution where the vindex approach or even numpy is used when the data to process are very small compared to the image array, otherwise I will go with your approach.
Not sure if I should open a new thread, anyway investigating about performance when reading from file, I noticed a behavior I did not expected regarding the function where. Suppose one has a zarr array:
Well, I don’t think Dask is clever enough to read only the data it needs with the where function. As I understand it, where is only some conditional applied to return values from an array or another one. I expect it would be very difficult to make Dask aware of the workflow before where call in order to prevent it to read just some parts of an array.