Dask and distributed versions: 2024.11.2
My autoscaler:
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
name: dask-primary
spec:
cluster: dask-primary
minimum: 1 # tried 2 as well
maximum: 5
I have a long running task limited by a resource: --resources DOCLING=1
(flag on worker in k8s)
I run my task as part of graph inside a delayed
function:
with worker_client() as client:
converted = client.submit(
docling_run_submitted_pipeline,
pipeline,
data,
resources={"DOCLING": 1},
).result()
Dask correctly limits parallel executions of limited resource to 1 per worker. However dashboard shows that workers have multiple of these limited tasks in “processing” state just hanging there waiting. This locked task is very slow and I expected scheduler to spawn new worker in this case (it sits at minimum allowed workers). New workers are only spawned when i have hundreds of small tasks but never for unlocking this resource.
Even when there are minimum of 2 workers 1 can stay idle with 0 tasks for a couple minutes while the other has these limited ones. Looks like task stealing isn’t working.
I’ve read that dask learns task durations over time but this doesn’t seem to help here (although not sure how long it takes to measure. I’ve tested 10-20 of these tasks over a span of ~1 hour, no scaling happens)
I only have a few settings overrides:
- name: DASK_DISTRIBUTED__SCHEDULER__WORKER_TTL
value: 2m # 5m
- name: DASK_DISTRIBUTED__WORKER__LIFETIME__DURATION
value: 30m # None
- name: DASK_DISTRIBUTED__WORKER__LIFETIME__STAGGER
value: 60s # None
I have not yet experimented with distributed.scheduler.worker-saturation
and distributed.scheduler.unknown-task-duration