Hi, team,
I use Dask as the distributed infrastructure to train AI model. I have the function to compute the result on the worker, and use the Dask client to trigger up the calculation as the follows.
from dask.distributed import get_client
cli = get_client(address = f"tcp://my_host:my_port")
cli.wait_for_workers(n_workers=my_worker_num)
my_result = cli.run(my_calculation_func, **kwargs, wait=True)
I am sure that my compute function (“my_calculation_func” in above code snippet) needs less than 1 second, however the “cli.run” function (“cli.run” in above code snippet) needs more than 10 seconds to finish when I use 30 workers. Moreover, more workers make the duration of “cli.run” function longer. Finally when I use 120 workers “cli.run” function needs more than 50 seconds, which is unacceptable to me.
My Dask version is 2022.2.0. I have tested “cli.submit(my_calculation_func, pure=False)” and it has poorer performance than “cli.run”. Meanwhile, each worker runs in the separate server, and there are little data transferred between the workers and client so my network has enough bandwidth.
My question, is the time cost of “run” function reasonable? Is there any way to optimize it to reduce the duration?
Thanks in advance.
Chris Ding