New workers are idle while tasks are pending

NakulK48 · December 2, 2022, 4:04pm

In our workflow, an HTTP server receives a request containing a payload for processing, starts a Dask cluster, splits the payload into batches and then submits each batch via Dask to the cluster (via fire_and_forget).

The workers are on EC2 (via Kubernetes) so naturally they don’t start up instantly - an instance needs to be found, it needs to download the Docker image, etc.

I just tried a small run which had 7 batches, and spun up a 7-worker cluster to run them. Intriguingly, even though all 7 machines were up and ready, only 3 of them were actually processing - 3 of the tasks were completed in parallel and then the next 3. Looking at the logs of the other 4 workers showed that nothing was happening at all.

I figure the scheduler must somehow be assigning all of these tasks to whichever machine happens to be available at the time they are submitted - but this doesn’t strike me as the right behaviour at all. Surely it’s only when the task is about to start that a worker node should be selected. So if only 3 nodes are up, the task has to wait to begin, but if an idle node is available, execution should start immediately on that node.

Is there any way to get this behaviour?

guillaumeeb · January 16, 2023, 4:32pm

Hi @NakulK48,

This is right, especially with a small number of tasks.

But to achieve what you want, Dask uses work stealing. If a worker is idle, it should tell it to the Scheduler and steal a task from another worker. So what you’ve got to understand is why this isn’t happening in your case. As a wild guess, I would say that Dask is tune for short time tasks by default, so the Scheduler must consider it’s a waste of time for a worker to take only one tasks from others. Maybe this could be modified by tuning the work stealing feature somehow.

Topic		Replies	Views
Redundant scheduling of straggler tasks Distributed	3	202	March 24, 2023
Dask distributed performance issues Distributed kubernetes , future , distributed	1	246	December 7, 2022
Workers not scaling up despite tasks being locked by limited resource Distributed kubernetes , distributed	1	30	January 15, 2025
How can I let only a few workers start a small task, while others wait? (on a remote cluster) Distributed dask-gateway , delayed	2	238	September 14, 2023
Tasks forgotten waiting for new workers to be allocated Distributed dask-jobqueue , distributed	8	81	June 6, 2025

New workers are idle while tasks are pending

Related topics