Dask distributed performance issues

jadeidev · December 5, 2022, 6:16am

I have been using dask on k8s to run tasks. I tested 3 scenarios:

10,000 tasks distributed over 100 workers ( each worker is a pod with 2 CPU (processes=100 threads=200)). Took 450 seconds
10,000 tasks distributed over 200 workers ( each worker is a pod with 2 CPU (processes=200 threads=400)). Took 960 seconds
10,000 tasks distributed over 100 workers ( each worker is a pod with 4 CPU (processes=100 threads=400)). Took 1000 seconds

I would have expected that more workers\threads would get the job done faster but it was completely the other way around. I tried to increase the scheduler’s and client’s resources (both CPU and memory) but it didn’t make any difference.
The nature of the task graph is pretty simple in that tasks are independent of each other.
Of the 10000 tasks, we either do some ML inference or some computations with NumPy.

When scenarios 2 and 3 run I often see the following behavior on the dashboard

wista

I was looking in dask docs Scheduling — Dask documentation but wasn’t quite sure whats the best approach here.

Any advice on this would be great

crusaderky · December 7, 2022, 3:31pm

Hi,
I can’t reproduce your issue on a LocalCluster running dask 2022.12.0 on a 32 CPUs Linux box.

import time
import distributed
from dask import delayed

c = distributed.Client(n_workers=100, threads_per_worker=2, memory_limit="8 GiB")
dsleep = delayed(time.sleep)
ds = [dsleep(4, dask_key_name=("x", i)) for i in range(10_000)]
%time _ = c.gather(c.compute(ds))

On 100 workers: 3m24s (4s overhead)
On 200 workers: 1m56 (16s overhead)

To debug your problem:

If you replace your real workload with just sleep calls, like above, can you still reproduce the issue?
If the problem disappears, are you sure you don’t have a bottleneck on I/O?

Topic		Replies	Views
Run dask in parallel doesn't work as expected, in distributed kubernetes pods Distributed	11	489	March 17, 2023
Only 1 worker is running when the DAG is forking Distributed	1	160	September 11, 2023
Limit number of queued tasks per worker Distributed delayed , distributed	3	157	October 4, 2024
Tasks slowing down significantly after 10-12 batches Distributed delayed , future , distributed	8	320	January 31, 2024
Why are my tasks not executing in parallel? Distributed	3	492	June 11, 2022

Dask distributed performance issues

Related topics