I have several 3D numpy arrays of shape (1100, 1000, 2000) which each take up approximately 3GB. I wanted an efficient way to store all these series of 10 3D volumes and I came across zarr which seems to be a powerful tool. I was not sure about how to chunk the volumes so I had simply used a chunk of (1, 55, 50, 100) which is just dividing the original shape by 20.
store = zarr.DirectoryStore(output_store)
zarr_array = zarr.create(shape=(10, 1100, 1000, 2000) ,
chunks=(1, 55, 50, 100),
dtype=dtype,
compressor=compressor,
store=store,
overwrite=True)
And now my goal was to display these 3D volumes as a time series in napari. But before I do that, I need to threshold each and every single volume, so I wrote the following script.
import napari
import zarr
import dask.array as da
import numpy as np
def adjust_threshold(data: da.Array, threshold: int) -> da.Array:
image = da.clip(data, 0, threshold)
image = (image / threshold) * 255
image = image.astype(np.uint8)
return image
data = zarr.open(zarr_path, mode='r')
data = da.from_zarr(data)
# Process each chunk independently
processed_data = adjust_threshold(data, threshold_level)
result = processed_data.compute()
However, it takes a very long time for my computer to process all this and I was now wondering if I am leveraging the full capabilities of parallelism for visualizing in napari. In the dask dashboard all I see is the astype function, but none of the other operations being displayed. Ideally, each 3D volume should be processed concurrently by dask but I cannot tell if that is the case in my instance.