I have successfully deployed a Dask k8s cluster via the operator helm chart and used the example custom resource (cluster-spec.yml) from Migrating from classic — Dask Kubernetes 2024.9.1.dev4+gf30da72 documentation. Client code works as expected, everything appears functional.
There is a rough edge though that I am trying to figure out. When creating a cluster apparently the namespace ends up with a finalizer added to it. While experimenting I often blow away and recreate the resources that I deploy, and usually that works fine (I use helmsman typically). However in this case, namespace deletion hangs on the existence of that finalizer. I have only found two ways to somewhat mitigate this.
-
Don’t create and delete the custom resource (the example cluster) as a helm hook. Instead just directly use kubectl apply after helmsman creates the operator, and conversely directly use kubectl delete before having helmsman destroy it.
-
Use kubectl to manually patch the finalizer to remove it, when namespace deletion hangs.
Now the only other added datapoint I can offer is that if I example the custom resource, it always has a status of “Pending” which might relate in part to helmsman/helm not completely handling postInstall and preDelete hooks completely correctly. However as described above, I’ve also experienced the issue with splitting out the cluster creation to kubectl so right now I’m inclined to suspect that anything related to helmsman or helm is a red herring, that the core issue relates to the custom resource or the CRD, or perhaps the operator.
$ kubectl get daskclusters.kubernetes.dask.org/example
NAME WORKERS STATUS AGE
example 10 Pending 17s
This leads me to wonder if the source of this issue of getting stuck has something to do with how the custom resource, or the CRD itself, deals with that painful historical mishmash in k8s related to readiness gates vs status conditions. Maybe either the example cluster specification, or the CRD, or both, need something to help Kubernetes understand that cluster deployment succeeded?
Any insights appreciated.