I’m using dask to process whole-slide imaging (WSI) data and it has been very helpful!
I’ve been working on affine transforming the input WSI with the strategy of pre-locating the output (transformed) image as a chunked image and figuring out the mapping between each of the output chunks and the input WSI. Before switching to dask, I was using zarr array as the input image, and the behavior was expected - with chunked input WSI, only the requested region was loaded into RAM when iterating over the output image chunks.
After switching to using dask array as input WSI, I’m seeing that when iterating over the output chunk, the input dask array seems to be compute()
in each iteration. Here’s the graph of the following snippet, note the “finalize” node and its direct child (rectangle box) there
import dask.array as da
import numpy as np
ref_img = da.from_array(np.eye(2), chunks=1)
out_img = da.empty_like(ref_img)
out_img.map_blocks(
lambda x, y: np.atleast_2d(y[0, 0])+1,
y=ref_img,
dtype=ref_img.dtype
).visualize('da-as-source.png')
If numpy array was used as the input image, here’s the snippet and graph
out_img.map_blocks(
lambda x, y: np.atleast_2d(y[0, 0])+1,
y=np.eye(2),
dtype=ref_img.dtype
).visualize('npa-as-source.png')
(sorry new user can only embed one image)
link to npa-as-source.png
I was expecting the dask array would have the same behavior as the numpy array, i.e. it’ll be sent to each of the task without the “finalize” step and each task will just get_item
and only the relevant data will be touched (in the snipped the one pixel at the upper left corner).
Is this the expected behavior? How I might be able to change the behavior?
Thanks in advance!