I’m usign Dask to scale an ML processing on multispectral images over large areas.
In the first step, there is a dataset preparation:
through ODC (Open Data Cube https://www.opendatacube.org/) features are retrieved as a Xarray Dataset, reshaped in a 2D Dask Array, then rows with NaNs in any of feature are discarded and aligned with the ground truth stored in a 1D Dask Array. Finally both array are saved to disk through .to_zarr().
Because of row discarding, the DaskArray will become in an irregular chunk-size and in order to save it as ZARR, a rechunking is required.
During the above operations of rechunking and reshaping, on the Dashboard I see many red highlights and so a really poor efficiency, as in picture.
How Can I avoid, or mitigate, this issue? Could the ‘Work Stealing’ be related?
Furthermore, Why this process tends to accumulate unmanaged memory?