Subtracting Arrays from Chunks Efficiently

Hello everyone!

I’m currently trying to perform subtractions on dask arrays that I build using Tifffile before writing them all out to zarrs or HDF5. Here’s the code that I’m using to accomplish the first step. And here’s the problem I’m trying to solve.

I’m sure that I could do the subtraction for each image one by one as they’re read into the array, but I think maybe it’d be better to wait until the chunk is fully made than perform the subtraction on the entire chunk in one move as a vectorized operation. I figured Dask would be the way I can do that nice and fast!

Any advice for how to do this properly?

Hi @jmdelahanty

Interesting task!

Here’s a suggestion. If it doesn’t suit your purpose feel free to counter.

Note that the code snippets here have not been tested at all! They are only meant to serve as a guide (or outline). After all, I’m not sure what subtraction means in your case nor what exactly you intend to subtract from each image.

So I’d first define a function:

def subtract(image, background=None):
	...
	return result

where image is a numpy array representing an image (or a chunk of an image) and background is a numpy array which you want to subtract from image. The returned result must have the same shape as image.

This function subtract() will later serve as a chunk-function for dask.array.map_blocks()

import tifffile
import dask.array as da


def batch_subtract_tiff2hdf5(
    hdf5file, tifffiles, background, dataset_name='2p', chunksize=600, compression=None
):
    """Subtract background from TIFF images and write them to chunked dataset in HDF5 file."""

	# Perhaps the following line works for the tiff format (I do not know)
	# otherwise you can use the method you were using before:
    images = da.image.imread(tifffiles)
    print(images)

	# The following line takes advatage of parallelism in a compute cluster to do the subtraction:
	# and so it should be fast
	images = images.map_blocks(subtract, background=background)

	# write processed images to disk:
	images.to_hdf5(
		hdf5file,
		dataset_name,
		shape=images.shape,
		dtype=images.dtype,
		chunks=(chunksize, *images.shape[1:]),
		compression=compression,
	)

background = ...  # whatever you want to subtract
with tifffile.Timer():
    batch_subtract_tiff2hdf5(
        'test.hdf5', 'I:/Scratch/nih3t3-egfp_2/Pos0/*.tif', background, compression='lzf'
    )
3 Likes

Thank you so much for your reply! This makes sense. I’ll give this a shot tomorrow and report back how it goes.

One quick question I have is about how I should store my background image.

In my case, I just have an average image of an artifact to subtract. It’s just a tiff file as well. But I’m thinking that maybe I can just make a library of numpy arrays written to an H5/zarr or something that contain these average tiffs. That way I can flexibly grab different averages to subtract depending on the settings used in recordings.

How do you think I should store these average images? Is just storing many different average images in a folder a better way to do it you think?

Good question.

In my previous post the variable background was assumed (perhaps wrongly) to be a numpy array. But if it is instead an image file, then it is important that the background image is aligned block-wise with each of your image files.

This mainly means a few things:

  1. background needs to be a dask array too.
  2. Consequently background must no longer be defined as a keyword parameter to the function subtract(), but rather as a positional parameter:
    def subtract(image, background):
       ...
       return result
    
  3. This also applies to the map_blocks() call:
    images = images.map_blocks(subtract, background)
    
  4. In order for map_blocks() to work, background must first have, along each axis, the same number of chunks as each of your images. If background has the same shape as any image, you may use .rechunk() to ensure that the number of chunks match:
    background = background.rechunk(chunks=images.chunks[1:])
    
    In this way, the chunks of background are “broadcastable” to the chunks of all your images so that corresponding chunk-regions in background and each image can be subtracted.

I suppose that, like your other images, you may store your background images in any format (Tiff, Zarr, Hdf5) that suits you best, as long as the above conditions are met during processing. Unfortunately, I do not have enough experience with these formats myself to give you any meaningful recommendation for any particular one of them.

import tifffile
import dask.array as da


def subtract(image, background):
    ...
    return result


def batch_subtract_tiff2hdf5(
    hdf5file, tifffiles, background, dataset_name='2p', chunksize=600, compression=None
):
    """Subtract background from TIFF images and write them to chunked dataset in HDF5 file."""

    # Perhaps the following line works for the tiff format (I do not know)
    # otherwise you can use the method you were using before:
    images = da.image.imread(tifffiles)
    print(images)

    # make sure each background image is chunked the same way as each source image:
    background = background.rechunk(chunks=images.chunks[1:])

    # The following line takes advatage of parallelism in a compute cluster to do the subtraction:
    # and so it should be fast
    images = images.map_blocks(subtract, background)

    # write processed images to disk:
    images.to_hdf5(
        hdf5file,
        dataset_name,
        shape=images.shape,
        dtype=images.dtype,
        chunks=(chunksize, *images.shape[1:]),
        compression=compression,
    )


# dask provides several options to laod an array from disk
# see https://docs.dask.org/en/stable/array-api.html#create-and-store-arrays
background = da.image.imread('I:/Scratch/averages/background1.tif')
print(background)

with tifffile.Timer():
    batch_subtract_tiff2hdf5(
        'test.hdf5', 'I:/Scratch/nih3t3-egfp_2/Pos0/*.tif', background, compression='lzf'
    )
3 Likes

I can definitely make it a numpy array! I’m pretty sure that the intermediate format for these chunks is numpy arrays under the hood right?

Thank you also for this update! I’ll try it out and see how it goes. Will update here when I try it out.

1 Like

Quick update here: We’re still evaluating whether or not doing it this way is a better option or if we should generate an FFT mask out of an average of the signal, apply a gaussian filter to it, and then use that as the mask. Then, during preprocessing, we would perform FFTs upon our images, apply the mask on a per-chunk basis, and then finally get everything into H5 or zarr at the end for later processing.

What do you think about that @ParticularMiner ?

Looks to me like a pretty solid strategy @jmdelahanty!

Looking through the dask docs on FFT’s, I see that they require each image to be “un-chunked”. But I don’t think this should be much of a problem since nowadays an entire image can usually fit easily in memory. This just means that you should avoid chunking any image itself, that is, the last 2 axes of your images-array should not be chunked. And this still remains an “embarrassingly parallel” task (if you have lots of images to filter at once), and dask will be able to handle it pretty well, in my opinion.

1 Like

Martin Durant told me the same thing! We have 47k images per recording, and soon we may have something close to 100k images to get through per recording! Each image is really small, about .5MB, so doing it to each image should be pretty simple I’m guessing. I’m working on that other post you commented on at the moment, so once that’s done I’ll get back into doing this again! Hopefully over the next few days.