Hello. I’m seeking advice on how to control data locality when reading chunks from a zarr file. That is, I want to express the fact that all chunks to be processed by (distributed) worker i are loaded by worker i so that no transfers should occur between workers when concatenating after the getitem-* stage.
I put in place a scheduler plugin that seems to do the right thing, but looking at the task dashboard I still don’t see each worker loading each 8 chuncks.
2025-11-26 13:54:48,664 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 1, 1, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,665 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 1, 0, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,665 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 3, 1, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,665 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 3, 0, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 0, 0, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 2, 1, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 2, 0, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 1, 0, 1, 0) -> ucx://10.91.54.39:60156
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 0, 0, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 3, 0, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 2, 0, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 3, 1, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 2, 1, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 1, 0, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,682 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 1, 1, 0) -> ucx://10.91.54.33:39796
2025-11-26 13:54:48,683 - distributed.scheduler - INFO - [GetitemPlacementPlugin] transition: ('getitem-9f5e00374ed64376acacd763a6019e3f', 0, 0, 1, 0) -> ucx://10.91.54.33:39796
Here I’d like to constrain the first 8 getitem to worker 0, and the 8 following to worker 1 (mapping based on the value of the first dimension of the data array to be processed in my case)
Is there a way to impose such data locality hard constraints?
Thank you.