Workers not scaling up despite tasks being locked by limited resource

Fogapod · January 10, 2025, 4:20pm

Dask and distributed versions: 2024.11.2

My autoscaler:

apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: dask-primary
spec:
  cluster: dask-primary
  minimum: 1 # tried 2 as well
  maximum: 5

I have a long running task limited by a resource: --resources DOCLING=1 (flag on worker in k8s)
I run my task as part of graph inside a delayed function:

    with worker_client() as client:
        converted = client.submit(
            docling_run_submitted_pipeline,
            pipeline,
            data,
            resources={"DOCLING": 1},
        ).result()

Dask correctly limits parallel executions of limited resource to 1 per worker. However dashboard shows that workers have multiple of these limited tasks in “processing” state just hanging there waiting. This locked task is very slow and I expected scheduler to spawn new worker in this case (it sits at minimum allowed workers). New workers are only spawned when i have hundreds of small tasks but never for unlocking this resource.
Even when there are minimum of 2 workers 1 can stay idle with 0 tasks for a couple minutes while the other has these limited ones. Looks like task stealing isn’t working.

I’ve read that dask learns task durations over time but this doesn’t seem to help here (although not sure how long it takes to measure. I’ve tested 10-20 of these tasks over a span of ~1 hour, no scaling happens)

I only have a few settings overrides:

            - name: DASK_DISTRIBUTED__SCHEDULER__WORKER_TTL
              value: 2m  # 5m
            - name: DASK_DISTRIBUTED__WORKER__LIFETIME__DURATION
              value: 30m  # None
            - name: DASK_DISTRIBUTED__WORKER__LIFETIME__STAGGER
              value: 60s  # None

I have not yet experimented with distributed.scheduler.worker-saturation and distributed.scheduler.unknown-task-duration

guillaumeeb · January 15, 2025, 7:33am

Hi @Fogapod,

I think there is an old issue of work stealing and/or adaptive behaviour when using resources:

github.com/dask/distributed

Scheduler behaves badly when adaptively adding workers to meet resource demand

opened 04:51PM - 22 Mar 18 UTC

mattilyra

adaptive

There is an issue with how the scheduler assigns tasks from the `unrannable` que…ue to workers who meet the resource requirements joining the scheduler. The use case is some long running complex task where some tasks require an expensive resource, say GPUs, but those resources are only provided (through `Adaptive`) once the tasks requiring those resources are ready to be run. Say we come to a point in the computation where 5 tasks could be run, if GPUs were available. Managing the `scale_up` behaviour through adaptive is fairly straightforward and allows adding new compute nodes (for instance on AWS) with the required resources. The problem appears when the first of the new workers connects. `Scheduler.addWorker` will go through the list of `unrunnable` tasks and check if there are workers that meet the requirements, since only the first worker has thus far connected there is only one worker that meets the requirements (some possibly positive number of workers may or may not booting up and joining shortly, but that hasn't happened yet). for ts in list(self.unrunnable): valid = self.valid_workers(ts) if valid is True or ws in valid: recommendations[ts.key] = 'waiting' The task goes through `released -> waiting` and then `waiting -> processing`, the `transition_waiting_processing` again calls `valid_workers` to get a list of workers where the task(s) can be run (this list still just contains a single worker because the other ones haven't yet connected). The end result of all of this is that the worker who happens connect first and have the resource required by the tasks gets all the tasks dumped onto it with all the other workers, who potentially connect just seconds later, get nothing and are shutdown by the scheduler because they are idling. In short, it appears to be the case that the purpose of the `resource_requirements` is to act as a hint of required peak performance (memory, GPU, whatever) from the workers, and not to be a dynamically changing resource allocation. Is this the case, and is there any interested in changing that? The resources available and resources consumed are taken into account in `transition_waiting_processing`, but only on the `worker_state` not for the scheduler in general. If this is not the intended behaviour and should be fixed, I'm more than happy to work on this.

I thought the work-stealing point had been corrected, but I’m not sure. And about the adaptive part I don’t think it has been. So I would say that for now adaptive with resources use is not working, but we would welcome any help in this direction!

Topic		Replies	Views
Scheduler not saturating workers Distributed future , distributed	9	306	August 9, 2023
Limit number of queued tasks per worker Distributed delayed , distributed	3	157	October 4, 2024
Adaptive Scaling while not rerunning non-idempotent tasks Distributed	8	223	December 22, 2023
Only 1 worker is running when the DAG is forking Distributed	1	160	September 11, 2023
Dask distributed performance issues Distributed kubernetes , future , distributed	1	247	December 7, 2022

Workers not scaling up despite tasks being locked by limited resource

Related topics