I have been using dask on k8s to run tasks. I tested 3 scenarios:
- 10,000 tasks distributed over 100 workers ( each worker is a pod with 2 CPU (processes=100 threads=200)). Took 450 seconds
- 10,000 tasks distributed over 200 workers ( each worker is a pod with 2 CPU (processes=200 threads=400)). Took 960 seconds
- 10,000 tasks distributed over 100 workers ( each worker is a pod with 4 CPU (processes=100 threads=400)). Took 1000 seconds
I would have expected that more workers\threads would get the job done faster but it was completely the other way around. I tried to increase the scheduler’s and client’s resources (both CPU and memory) but it didn’t make any difference.
The nature of the task graph is pretty simple in that tasks are independent of each other.
Of the 10000 tasks, we either do some ML inference or some computations with NumPy.
When scenarios 2 and 3 run I often see the following behavior on the dashboard
I was looking in dask docs Scheduling — Dask documentation but wasn’t quite sure whats the best approach here.
Any advice on this would be great