Client B submits a compute pipeline on the published dataset on the same cluster
Would it be possible to confirm where the compute tasks would be assigned ? Specifically will the workers hosting specific chunks be assigned the compute tasks related to those chunks ?
In the common case distributed runs tasks on workers that already hold dependent data. If you have a task f(x) that requires some data x then that task will very likely be run on the worker that already holds x.
Keep in mind that this is what the Scheduler is trying to achieve, but depending on other concerns (Worker availability for example), a task will not always be allocated where the data lives.