Hello everyone, I have been working on large 3D image processing with dask and dask_image. The default dask_image ndmeasure functions worked for me, though loading data into memory is a major bottleneck.
Changing to a strategy with dask.array.reduction was the outcome. Effectively, I use it to measure multiple things for all objects present in a chunk and then combine these. Jointly computing measures can even help speed-up and most measures reduce well (min/max/histogram/bounding_box etc.). The split_every argument even enables spatially logical reduction (e.g. in a 2x2x2 manner).
Some measures do not play nice with this strategy (like a median). In theory, a list of all intensity values per object would be possible if memory allows it. In cases it does not, writing to disk [and (re-)chunking] would have to be the fall-back option.
Now my questions:
- Is there a reason that ndmeasure loads/unloads a chunk separately for all objects/indices?
- For operations that could exceed memory, what would be a good strategy?*
- Memory for tasks that store all intensity values / positions can be structured easily. E.g. if I go over all chunks once, I can extract that chunk 1 contains the first X positions of object A. chunk 2 contains positions X to X+Y and so on. In a second pass I could then write values in such a manner. Is there a dask friendly way to achieve this?
*The best I could find was this old blogpost, but the authors of the chest package state that it is not multi-process safe.