Hi @jacobtomlinson / @davidp1404 ,
We are having a similar issue with pods that are in a pending state or removed due to NetworkNotReady or similar issues and the scaling/reconciliation process falls apart without re-trying.
[2023-06-23 10:46:25,279] kopf.objects [DEBUG ] [ns/cluster-name-worker-b6355ac1c0] Something has changed, but we are not interested (the essence is the same).
[2023-06-23 10:46:25,279] kopf.objects [DEBUG ] [ns/cluster-name-worker-b6355ac1c0] Handling cycle is finished, waiting for new changes.
[2023-06-23 10:46:29,976] kopf.objects [DEBUG ] [ns/cluster-name-worker-b6355ac1c0] Deletion, but we are done with it, and we do not care.
[2023-06-23 10:46:29,976] kopf.objects [DEBUG ] [ns/cluster-name-worker-b6355ac1c0] Handling cycle is finished, waiting for new changes.
[2023-06-23 10:46:30,818] kopf.objects [DEBUG ] [ns/cluster-name-worker-b6355ac1c0] Deleted, really deleted, and we are notified.
We are using karpenter to scale the nodes so it can take a bit of time (but still within a minute or two). So if all nodes are not primed to place the pods, then Dask Operator just ignores the scale.
-
Above was a scenario where I was trying to scale from 25 to 35 and then to 45.
-
The DaskClusterGroup Spec clearly shows 45.
-
However, the cluster end up with 28 working pods. (example scenario)
Following is excerpt from the Workergroup manifest,
cluster: cluster-name
worker:
replicas: 45
Is there a timeout or re-check parameter that we can provide to the operator to actually reconcile this properly? Do you have any other suggestions to have this scaling performed reliably?
This behaviour is during the initialization even before any dask operations are performed.
We have to also mention that we do have an auto-termination and dashboard expose plugin for dask-operator but it only overrides the following kopf events,
@kopf.on.create("service", labels={"dask.org/component": "scheduler"})
@kopf.on.delete("service", labels={"dask.org/component": "scheduler"})
@kopf.on.field("pod", field="status.phase", new="Succeeded", labels={"dask.org/component": "worker"})
Thanks