Many task transfers during reshaping/rechunking of array

matteosimone · January 13, 2023, 11:26am

Hi everyone.

I’m usign Dask to scale an ML processing on multispectral images over large areas.
In the first step, there is a dataset preparation:
through ODC (Open Data Cube https://www.opendatacube.org/) features are retrieved as a Xarray Dataset, reshaped in a 2D Dask Array, then rows with NaNs in any of feature are discarded and aligned with the ground truth stored in a 1D Dask Array. Finally both array are saved to disk through .to_zarr().

Because of row discarding, the DaskArray will become in an irregular chunk-size and in order to save it as ZARR, a rechunking is required.

During the above operations of rechunking and reshaping, on the Dashboard I see many red highlights and so a really poor efficiency, as in picture.

How Can I avoid, or mitigate, this issue? Could the ‘Work Stealing’ be related?
Furthermore, Why this process tends to accumulate unmanaged memory?

guillaumeeb · January 16, 2023, 5:34pm

Hi @matteosimone,

Unfortunately, there is probably no easy way to avoid this. Rechunking is by essence an operation that triggers a lot of data exchange between workers (the red bars you see). Every worker must send part of several chunks to a final worker which will hold it. This must be especially true with irregular chunks.
It also tends to accumulate memory because a lot of temporary storage is needed into memory.

Does you algorithm discard a lot of rows? One solution could be to write all the data, even rows with NaNs, and just filter upon reading.
You can also try to use a large chunk size for writing, so that less exchange between random workers occur.

Topic		Replies	Views
Overlapping Computations and Cloning Dask Array dask-array , distributed	11	691	May 2, 2023
Image Segmentation Using Large Dask Array Dask Array zarr , xarray , distributed	11	281	August 31, 2023
Delayed dataframe computation Distributed dask-array , xarray , distributed	2	503	April 28, 2022
Scheduling issues on SLURMCluster for transforming large arrays Distributed dask-jobqueue , scheduler	6	746	October 19, 2022
Efficient dask array repeat without further rechunking Dask Array dask-array	1	178	August 10, 2023

Many task transfers during reshaping/rechunking of array

Related topics