Get_dataset and task allocation to worker nodes

vij · June 24, 2023, 6:18am

I have the following usage pattern:

Client A publishes a dataset on a cluster
Client B submits a compute pipeline on the published dataset on the same cluster

Would it be possible to confirm where the compute tasks would be assigned ? Specifically will the workers hosting specific chunks be assigned the compute tasks related to those chunks ?

NOT A CONTRIBUTION

guillaumeeb · June 26, 2023, 8:09am

Hi @vij,

As explained here:
https://distributed.dask.org/en/stable/locality.html

In the common case distributed runs tasks on workers that already hold dependent data. If you have a task f(x) that requires some data x then that task will very likely be run on the worker that already holds x.

Keep in mind that this is what the Scheduler is trying to achieve, but depending on other concerns (Worker availability for example), a task will not always be allocated where the data lives.

Topic		Replies	Views
Where do dask clients fetch data from, scheduler or workers? Distributed dask-array	1	215	October 19, 2022
Operations on a partitioned DataFrame not actually distributed across workers Dask DataFrame distributed	4	325	May 13, 2022
Multi-region clusters Distributed	3	494	February 8, 2022
Computing chunks locally before sending to workers with map_blocks Distributed	1	22	July 18, 2024
How to efficiently distribute load with worker nodes? Distributed distributed	2	233	August 31, 2022

Get_dataset and task allocation to worker nodes

Related topics