I have successfully deployed a Dask k8s cluster via the operator helm chart and used the example custom resource (cluster-spec.yml) from Migrating from classic — Dask Kubernetes 2024.9.1.dev4+gf30da72 documentation. Client code works as expected, everything appears functional.
There is a rough edge though that I am trying to figure out. When creating a cluster apparently the namespace ends up with a finalizer added to it. While experimenting I often blow away and recreate the resources that I deploy, and usually that works fine (I use helmsman typically). However in this case, namespace deletion hangs on the existence of that finalizer. I have only found two ways to somewhat mitigate this.
-
Don’t create and delete the custom resource (the example cluster) as a helm hook. Instead just directly use kubectl apply after helmsman creates the operator, and conversely directly use kubectl delete before having helmsman destroy it.
-
Use kubectl to manually patch the finalizer to remove it, when namespace deletion hangs.
Now the only other added datapoint I can offer is that if I example the custom resource, it always has a status of “Pending” which might relate in part to helmsman/helm not completely handling postInstall and preDelete hooks completely correctly. However as described above, I’ve also experienced the issue with splitting out the cluster creation to kubectl so right now I’m inclined to suspect that anything related to helmsman or helm is a red herring, that the core issue relates to the custom resource or the CRD, or perhaps the operator.
$ kubectl get daskclusters.kubernetes.dask.org/example
NAME WORKERS STATUS AGE
example 10 Pending 17s
This leads me to wonder if the source of this issue of getting stuck has something to do with how the custom resource, or the CRD itself, deals with that painful historical mishmash in k8s related to readiness gates vs status conditions. Maybe either the example cluster specification, or the CRD, or both, need something to help Kubernetes understand that cluster deployment succeeded?
Any insights appreciated.
It sounds like you don’t have the dask-kubernetes operator controller running, at least some of the time.
When you create a DaskCluster
resource the controller springs to action and creates all the necessary Pods for the scheduler, workers, etc. Part of the controller’s workflow moves the cluster status from Created
to Pending
to Running
.
Then when you delete the DaskCluster
the controller handles a couple of cleanup tasks. The controller adds the finalizer to the DaskCluster
to ensure that the state doesn’t get blown away until the cleanup task has completed, then it removes the finalizer.
If a namespace is being deleted it needs to wait for all resources to have their finalizers removed before it can complete.
If you could share the exact steps you are following it would help to troubleshoot what’s going on. I suspect you are deleting the operator before deleting the DaskCluster
, which means the finalizer task never gets completed and the namespace deletion enters a deadlock.
One thing is definite, and that is that “Pending” never transitions to “Running”. The cluster is fully functional so far as I can tell (I have some test notebooks doing various dask array and dask-ml jobs to exercise it).
The rest of what you said definitely lines up, in that if I am the one manually doing a kubectl apply to create the DaskCluster after applying the operator via the chart, and later manually destroy the DaskCluster before destroying the operator via the chart, then issue with namespace deletion hanging does not take place.
Some added specifics.
- Installation is via the 2024.9.0 dask-kubernetes-operator helm chart. Installation is via
helmsman --f apply my-dask-operator.yaml
which contains (just extracting the relevant part from a larger deployment):
settings:
kubeContext: 'microk8s'
namespaces:
dask-op:
protected: false
apps:
dask-op:
namespace: 'dask-op'
chart: '../../charts/dask-kubernetes-operator'
version: '2024.9.0'
valuesFile: 'overrides/dask-op.yaml'
enabled: true
wait: false
# hooks:
# postInstall: 'hooks/dask-op-cluster.yaml'
# preDelete: 'hooks/dask-op-cluster.yaml'
- In the yaml above, the reference to overrides is simply overriding the values.yaml of the chart to temporarily disable the service account and RBAC while I’m experimenting with this, nothing else, just so I could ensure those weren’t a factor in anything going on:
serviceAccount:
create: false
rbac:
create: false
cluster: false
-
The hooks that are currently commented out are the ones that would apply and delete the chart. When those aren’t commented-out, that is when I experience the issue with doing a helmsman --destroy -f my-dask-operator.yaml
. Helmsman completes, but if I later try to do a kubectl delete ns dask-op
then the deletion hangs. I mostly clean up namespaces because while Helmsman creates them, it doesn’t clean up after itself by equivalently removing them. When exercising automation I want a clean situation on each run so I know any problems are intrinsic to what is being executed now, not cruft left from some past attempt.
-
The YAML for the DaskCluster is minor variations on what is shown below, mostly just experimenting with different images for the worker/scheduler and commenting out the yanking of a different version of distributed
from github/main (which seems a strange thing to have been in the example, for a system that is so fussy about equating library versions). Note that the restartPolicy
in the example cluster on the Dask website is incorrect, Kubernetes rejects because this is not a pod definition, hence why I comment it out. I’ve also tweaked to use a NodePort as it made testing via a remote client easier.
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
name: example
# labels:
# foo: bar
spec:
worker:
replicas: 10
spec:
# restartPolicy: Never
containers:
- name: worker
image: 'ghcr.io/dask/dask:latest'
imagePullPolicy: 'Always'
args: [dask-worker, --nthreads, '2', --no-dashboard, --memory-limit, 10GB, --death-timeout, '60', '--name', $(DASK_WORKER_NAME)]
# env:
# - name: EXTRA_PIP_PACKAGES
# value: git+https://github.com/dask/distributed
resources:
limits:
cpu: '2'
memory: 6G
requests:
cpu: '2'
memory: 6G
scheduler:
spec:
containers:
- name: scheduler
image: 'ghcr.io/dask/dask:latest'
imagePullPolicy: 'Always'
args:
- dask-scheduler
resources:
limits:
cpu: '2'
memory: 8G
requests:
cpu: '2'
memory: 8G
ports:
- name: tcp-comm
containerPort: 8786
protocol: TCP
- name: http-dashboard
containerPort: 8787
protocol: TCP
readinessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
port: http-dashboard
path: /health
initialDelaySeconds: 15
periodSeconds: 20
service:
type: NodePort
selector:
dask.org/cluster-name: example
dask.org/component: scheduler
ports:
- name: tcp-comm
protocol: TCP
port: 8786
targetPort: 'tcp-comm'
nodePort: 32200
- name: http-dashboard
protocol: TCP
port: 8787
targetPort: 'http-dashboard'
nodePort: 32201
So to wrap up: using the postInstall
and preDelete
hooks is when I experience the problem with a namespace not being deletable. They should take place after operator creation and before operator deletion. I had wondered if in part this was due to not being able to tell helmsman/helm what the final successful status was for the hook for the successCondition
. Due to your response I now know it is Running
(the CRD does not specify states, which I believe is an option although my grokking of CRD subtleties is limited). But I guess that still begs the question: how do I have a perfectly functional cluster in all other aspects, but with its Kubernetes status stuck at Pending
?
Another wrinkle I guess is that, as I understand it, it is possible for hooks in kubernetes to reference their parent object so that deletion of the parent automatically deletes the child. This would be something under the control of the operator, I think.
Ok this is making more sense. Your pattern of installing the operator and the Dask cluster as a single operation is definitely unusual. Usually folks install the operator at some point in time, and then create one or more Dask clusters as and when they need them.
It definitely sounds like we have a race condition when you delete a DaskCluster and then the operator in quick succession. I suspect we may be able to resolve this by setting up a parent reference between the DaskCluster resource and the controller Deployment. Would you mind opening an issue on GitHub to track this?
The fact that your cluster always remains in Pending
concerns me. Could you share the logs from the controller Pod so we can see what is causing this?
1 Like
I will do so, probably sometime this evening. Thanks for looking into it.
And yes, admittedly, I tend to kick automation tires harder than may be the norm. I do a fair bit of R&D experimentation on different technologies so spinning them up and yanking them back down again to assess different configurations is part of my workflow.