Get_dataset and task allocation to worker nodes

I have the following usage pattern:

  1. Client A publishes a dataset on a cluster
  2. Client B submits a compute pipeline on the published dataset on the same cluster

Would it be possible to confirm where the compute tasks would be assigned ? Specifically will the workers hosting specific chunks be assigned the compute tasks related to those chunks ?


NOT A CONTRIBUTION

Hi @vij,

As explained here:
https://distributed.dask.org/en/stable/locality.html

In the common case distributed runs tasks on workers that already hold dependent data. If you have a task f(x) that requires some data x then that task will very likely be run on the worker that already holds x.

Keep in mind that this is what the Scheduler is trying to achieve, but depending on other concerns (Worker availability for example), a task will not always be allocated where the data lives.