Hi Dask developers,
I have a question about dask in distributed settings: when calling something like result = client.compute(arr)
, where does the dask client fetch data from? Is it
- From workers: the client sets up extra network connections with workers, and after workers finish the computation, they directly send back the result to the client, or
- From the scheduler: after workers finish the computation, they send the result to the scheduler, and the scheduler send the result to the client?
The reason I am asking this question is that when deploying a dask cluster with many workers and many clients, I found the network bandwith of my dask scheduler is extremely high, while the bandwith of dask workers is normal. So I guess it might be the second case.
If my guess is right (the scheduler sends data to clients), it seems that the overall performance of dask clusters would be bounded by the network ability of the single-point dask scheduler if there are many clients. Is there a solution of this?
Many thanks.