Many task transfers during reshaping/rechunking of array

Hi everyone.

I’m usign Dask to scale an ML processing on multispectral images over large areas.
In the first step, there is a dataset preparation:
through ODC (Open Data Cube features are retrieved as a Xarray Dataset, reshaped in a 2D Dask Array, then rows with NaNs in any of feature are discarded and aligned with the ground truth stored in a 1D Dask Array. Finally both array are saved to disk through .to_zarr().

Because of row discarding, the DaskArray will become in an irregular chunk-size and in order to save it as ZARR, a rechunking is required.

During the above operations of rechunking and reshaping, on the Dashboard I see many red highlights and so a really poor efficiency, as in picture.

How Can I avoid, or mitigate, this issue? Could the ‘Work Stealing’ be related?
Furthermore, Why this process tends to accumulate unmanaged memory?

Hi @matteosimone,

Unfortunately, there is probably no easy way to avoid this. Rechunking is by essence an operation that triggers a lot of data exchange between workers (the red bars you see). Every worker must send part of several chunks to a final worker which will hold it. This must be especially true with irregular chunks.
It also tends to accumulate memory because a lot of temporary storage is needed into memory.

Does you algorithm discard a lot of rows? One solution could be to write all the data, even rows with NaNs, and just filter upon reading.
You can also try to use a large chunk size for writing, so that less exchange between random workers occur.