Determining how to efficiently store & process 4D Volumes in Zarr for Napari Visualization

I have several 3D numpy arrays of shape (1100, 1000, 2000) which each take up approximately 3GB. I wanted an efficient way to store all these series of 10 3D volumes and I came across zarr which seems to be a powerful tool. I was not sure about how to chunk the volumes so I had simply used a chunk of (1, 55, 50, 100) which is just dividing the original shape by 20.

store = zarr.DirectoryStore(output_store)
zarr_array = zarr.create(shape=(10, 1100, 1000, 2000) ,
                         chunks=(1, 55, 50, 100),
                         dtype=dtype,
                         compressor=compressor,
                         store=store,
                         overwrite=True)

And now my goal was to display these 3D volumes as a time series in napari. But before I do that, I need to threshold each and every single volume, so I wrote the following script.

import napari
import zarr
import dask.array as da
import numpy as np

def adjust_threshold(data: da.Array, threshold: int) -> da.Array:
    image = da.clip(data, 0, threshold) 
    image = (image / threshold) * 255     
    image = image.astype(np.uint8)        
    return image

data = zarr.open(zarr_path, mode='r')
data = da.from_zarr(data)

# Process each chunk independently
processed_data = adjust_threshold(data, threshold_level)

result = processed_data.compute()

However, it takes a very long time for my computer to process all this and I was now wondering if I am leveraging the full capabilities of parallelism for visualizing in napari. In the dask dashboard all I see is the astype function, but none of the other operations being displayed. Ideally, each 3D volume should be processed concurrently by dask but I cannot tell if that is the case in my instance.

Hi @Johnoker, welcome to Dask community!

I think this is the performance problem. By doing this you create 80 000 chunks of 1MiB of uncompressed data. A lot of small files on your disk, which would incur a lot of small IOs when processing the dataset.

Ideally, chunks should be at least 10 times bigger, and even 100 times.

Also, depending on how you want to visualize the data, you might want to chunk for example by keeping a whole time serie in the same chunk, so keep the entire length in the time dimension.

It’s perfectly normal, all your operations are done into a single task per chunk because they are embarrasingly parallel, and it takes the name of the last operation.

From the look of your dashboard, this is the case. However, each task is so small you lose a lot of time scheduling them instead of really doing a computation. Increase the chunk size!

1 Like