Where do dask clients fetch data from, scheduler or workers?

Arsenal591 · October 16, 2022, 6:11pm

Hi Dask developers,

I have a question about dask in distributed settings: when calling something like result = client.compute(arr), where does the dask client fetch data from? Is it

From workers: the client sets up extra network connections with workers, and after workers finish the computation, they directly send back the result to the client, or
From the scheduler: after workers finish the computation, they send the result to the scheduler, and the scheduler send the result to the client?

The reason I am asking this question is that when deploying a dask cluster with many workers and many clients, I found the network bandwith of my dask scheduler is extremely high, while the bandwith of dask workers is normal. So I guess it might be the second case.

If my guess is right (the scheduler sends data to clients), it seems that the overall performance of dask clusters would be bounded by the network ability of the single-point dask scheduler if there are many clients. Is there a solution of this?

Many thanks.

crusaderky · October 19, 2022, 3:30pm

It depends on the direct_to_workers parameter when you initialise the Client, which you can then override with the direct parameter in Client.gather and Client.scatter
https://distributed.dask.org/en/stable/api.html?highlight=direct#client

The default if omitted is false for a regular client and true for a client running inside a task.

Topic		Replies	Views
Which client is being used? Dask Array	3	246	January 14, 2022
Computing chunks locally before sending to workers with map_blocks Distributed	1	23	July 18, 2024
Get_dataset and task allocation to worker nodes Distributed	1	145	June 26, 2023
Debugging Dask - Futures API Distributed distributed	4	287	May 12, 2022
Client.submit() only running the code on one of the workers Distributed	3	870	March 22, 2022

Where do dask clients fetch data from, scheduler or workers?

Related topics