Dask distributed performance issues

I have been using dask on k8s to run tasks. I tested 3 scenarios:

  1. 10,000 tasks distributed over 100 workers ( each worker is a pod with 2 CPU (processes=100 threads=200)). Took 450 seconds
  2. 10,000 tasks distributed over 200 workers ( each worker is a pod with 2 CPU (processes=200 threads=400)). Took 960 seconds
  3. 10,000 tasks distributed over 100 workers ( each worker is a pod with 4 CPU (processes=100 threads=400)). Took 1000 seconds

I would have expected that more workers\threads would get the job done faster but it was completely the other way around. I tried to increase the scheduler’s and client’s resources (both CPU and memory) but it didn’t make any difference.
The nature of the task graph is pretty simple in that tasks are independent of each other.
Of the 10000 tasks, we either do some ML inference or some computations with NumPy.

When scenarios 2 and 3 run I often see the following behavior on the dashboard


I was looking in dask docs Scheduling — Dask documentation but wasn’t quite sure whats the best approach here.

Any advice on this would be great

I can’t reproduce your issue on a LocalCluster running dask 2022.12.0 on a 32 CPUs Linux box.

import time
import distributed
from dask import delayed

c = distributed.Client(n_workers=100, threads_per_worker=2, memory_limit="8 GiB")
dsleep = delayed(time.sleep)
ds = [dsleep(4, dask_key_name=("x", i)) for i in range(10_000)]
%time _ = c.gather(c.compute(ds))

On 100 workers: 3m24s (4s overhead)
On 200 workers: 1m56 (16s overhead)

To debug your problem:

  • If you replace your real workload with just sleep calls, like above, can you still reproduce the issue?
  • If the problem disappears, are you sure you don’t have a bottleneck on I/O?