Subtracting Arrays from Chunks Efficiently

ParticularMiner · March 30, 2022, 6:11am

Good question.

In my previous post the variable background was assumed (perhaps wrongly) to be a numpy array. But if it is instead an image file, then it is important that the background image is aligned block-wise with each of your image files.

This mainly means a few things:

background needs to be a dask array too.
Consequently background must no longer be defined as a keyword parameter to the function subtract(), but rather as a positional parameter:
```
def subtract(image, background):
   ...
   return result
```

This also applies to the map_blocks() call:

images = images.map_blocks(subtract, background)

In order for map_blocks() to work, background must first have, along each axis, the same number of chunks as each of your images. If background has the same shape as any image, you may use .rechunk() to ensure that the number of chunks match:
```
background = background.rechunk(chunks=images.chunks[1:])
```
In this way, the chunks of background are “broadcastable” to the chunks of all your images so that corresponding chunk-regions in background and each image can be subtracted.

I suppose that, like your other images, you may store your background images in any format (Tiff, Zarr, Hdf5) that suits you best, as long as the above conditions are met during processing. Unfortunately, I do not have enough experience with these formats myself to give you any meaningful recommendation for any particular one of them.

import tifffile
import dask.array as da


def subtract(image, background):
    ...
    return result


def batch_subtract_tiff2hdf5(
    hdf5file, tifffiles, background, dataset_name='2p', chunksize=600, compression=None
):
    """Subtract background from TIFF images and write them to chunked dataset in HDF5 file."""

    # Perhaps the following line works for the tiff format (I do not know)
    # otherwise you can use the method you were using before:
    images = da.image.imread(tifffiles)
    print(images)

    # make sure each background image is chunked the same way as each source image:
    background = background.rechunk(chunks=images.chunks[1:])

    # The following line takes advatage of parallelism in a compute cluster to do the subtraction:
    # and so it should be fast
    images = images.map_blocks(subtract, background)

    # write processed images to disk:
    images.to_hdf5(
        hdf5file,
        dataset_name,
        shape=images.shape,
        dtype=images.dtype,
        chunks=(chunksize, *images.shape[1:]),
        compression=compression,
    )


# dask provides several options to laod an array from disk
# see https://docs.dask.org/en/stable/array-api.html#create-and-store-arrays
background = da.image.imread('I:/Scratch/averages/background1.tif')
print(background)

with tifffile.Timer():
    batch_subtract_tiff2hdf5(
        'test.hdf5', 'I:/Scratch/nih3t3-egfp_2/Pos0/*.tif', background, compression='lzf'
    )

Topic		Replies	Views
Using da.delayed for Zarr processing: memory overhead & how to do it better? Dask Array dask-array , delayed	15	1829	January 16, 2024
Da.map_blocks introduces unexpected chunks? Dask Array dask-array	3	60	July 5, 2024
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	625	July 31, 2023
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1936	January 17, 2023
Change array shape with map_block function Dask Array	1	138	November 16, 2023

Subtracting Arrays from Chunks Efficiently

Related topics