Determining how to efficiently store & process 4D Volumes in Zarr for Napari Visualization

Johnoker · June 23, 2024, 7:41pm

I have several 3D numpy arrays of shape (1100, 1000, 2000) which each take up approximately 3GB. I wanted an efficient way to store all these series of 10 3D volumes and I came across zarr which seems to be a powerful tool. I was not sure about how to chunk the volumes so I had simply used a chunk of (1, 55, 50, 100) which is just dividing the original shape by 20.

store = zarr.DirectoryStore(output_store)
zarr_array = zarr.create(shape=(10, 1100, 1000, 2000) ,
                         chunks=(1, 55, 50, 100),
                         dtype=dtype,
                         compressor=compressor,
                         store=store,
                         overwrite=True)

And now my goal was to display these 3D volumes as a time series in napari. But before I do that, I need to threshold each and every single volume, so I wrote the following script.

import napari
import zarr
import dask.array as da
import numpy as np

def adjust_threshold(data: da.Array, threshold: int) -> da.Array:
    image = da.clip(data, 0, threshold) 
    image = (image / threshold) * 255     
    image = image.astype(np.uint8)        
    return image

data = zarr.open(zarr_path, mode='r')
data = da.from_zarr(data)

# Process each chunk independently
processed_data = adjust_threshold(data, threshold_level)

result = processed_data.compute()

However, it takes a very long time for my computer to process all this and I was now wondering if I am leveraging the full capabilities of parallelism for visualizing in napari. In the dask dashboard all I see is the astype function, but none of the other operations being displayed. Ideally, each 3D volume should be processed concurrently by dask but I cannot tell if that is the case in my instance.

guillaumeeb · June 26, 2024, 7:14pm

Hi @Johnoker, welcome to Dask community!

I think this is the performance problem. By doing this you create 80 000 chunks of 1MiB of uncompressed data. A lot of small files on your disk, which would incur a lot of small IOs when processing the dataset.

Ideally, chunks should be at least 10 times bigger, and even 100 times.

Also, depending on how you want to visualize the data, you might want to chunk for example by keeping a whole time serie in the same chunk, so keep the entire length in the time dimension.

It’s perfectly normal, all your operations are done into a single task per chunk because they are embarrasingly parallel, and it takes the name of the last operation.

From the look of your dashboard, this is the case. However, each task is so small you lose a lot of time scheduling them instead of really doing a computation. Increase the chunk size!

Topic		Replies	Views
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1936	January 17, 2023
Using da.delayed for Zarr processing: memory overhead & how to do it better? Dask Array dask-array , delayed	15	1829	January 16, 2024
Difference in loading performance between dask array and numpy/joblib Dask Array zarr , numpy	6	357	June 21, 2023
Calculating Average Image Intensity Over Zarr Chunks Dask Array zarr , dask-array	1	275	November 2, 2022
Image Segmentation Using Large Dask Array Dask Array zarr , xarray , distributed	11	282	August 31, 2023

Determining how to efficiently store & process 4D Volumes in Zarr for Napari Visualization

Related topics