Subtracting Arrays from Chunks Efficiently

Good question.

In my previous post the variable background was assumed (perhaps wrongly) to be a numpy array. But if it is instead an image file, then it is important that the background image is aligned block-wise with each of your image files.

This mainly means a few things:

  1. background needs to be a dask array too.
  2. Consequently background must no longer be defined as a keyword parameter to the function subtract(), but rather as a positional parameter:
    def subtract(image, background):
       ...
       return result
    
  3. This also applies to the map_blocks() call:
    images = images.map_blocks(subtract, background)
    
  4. In order for map_blocks() to work, background must first have, along each axis, the same number of chunks as each of your images. If background has the same shape as any image, you may use .rechunk() to ensure that the number of chunks match:
    background = background.rechunk(chunks=images.chunks[1:])
    
    In this way, the chunks of background are “broadcastable” to the chunks of all your images so that corresponding chunk-regions in background and each image can be subtracted.

I suppose that, like your other images, you may store your background images in any format (Tiff, Zarr, Hdf5) that suits you best, as long as the above conditions are met during processing. Unfortunately, I do not have enough experience with these formats myself to give you any meaningful recommendation for any particular one of them.

import tifffile
import dask.array as da


def subtract(image, background):
    ...
    return result


def batch_subtract_tiff2hdf5(
    hdf5file, tifffiles, background, dataset_name='2p', chunksize=600, compression=None
):
    """Subtract background from TIFF images and write them to chunked dataset in HDF5 file."""

    # Perhaps the following line works for the tiff format (I do not know)
    # otherwise you can use the method you were using before:
    images = da.image.imread(tifffiles)
    print(images)

    # make sure each background image is chunked the same way as each source image:
    background = background.rechunk(chunks=images.chunks[1:])

    # The following line takes advatage of parallelism in a compute cluster to do the subtraction:
    # and so it should be fast
    images = images.map_blocks(subtract, background)

    # write processed images to disk:
    images.to_hdf5(
        hdf5file,
        dataset_name,
        shape=images.shape,
        dtype=images.dtype,
        chunks=(chunksize, *images.shape[1:]),
        compression=compression,
    )


# dask provides several options to laod an array from disk
# see https://docs.dask.org/en/stable/array-api.html#create-and-store-arrays
background = da.image.imread('I:/Scratch/averages/background1.tif')
print(background)

with tifffile.Timer():
    batch_subtract_tiff2hdf5(
        'test.hdf5', 'I:/Scratch/nih3t3-egfp_2/Pos0/*.tif', background, compression='lzf'
    )
3 Likes