Dask-kubernetes cluster (custom resource) in Pending

I have successfully deployed a Dask k8s cluster via the operator helm chart and used the example custom resource (cluster-spec.yml) from Migrating from classic — Dask Kubernetes 2024.9.1.dev4+gf30da72 documentation. Client code works as expected, everything appears functional.

There is a rough edge though that I am trying to figure out. When creating a cluster apparently the namespace ends up with a finalizer added to it. While experimenting I often blow away and recreate the resources that I deploy, and usually that works fine (I use helmsman typically). However in this case, namespace deletion hangs on the existence of that finalizer. I have only found two ways to somewhat mitigate this.

  1. Don’t create and delete the custom resource (the example cluster) as a helm hook. Instead just directly use kubectl apply after helmsman creates the operator, and conversely directly use kubectl delete before having helmsman destroy it.

  2. Use kubectl to manually patch the finalizer to remove it, when namespace deletion hangs.

Now the only other added datapoint I can offer is that if I example the custom resource, it always has a status of “Pending” which might relate in part to helmsman/helm not completely handling postInstall and preDelete hooks completely correctly. However as described above, I’ve also experienced the issue with splitting out the cluster creation to kubectl so right now I’m inclined to suspect that anything related to helmsman or helm is a red herring, that the core issue relates to the custom resource or the CRD, or perhaps the operator.

$ kubectl get daskclusters.kubernetes.dask.org/example
NAME      WORKERS   STATUS    AGE
example   10        Pending   17s

This leads me to wonder if the source of this issue of getting stuck has something to do with how the custom resource, or the CRD itself, deals with that painful historical mishmash in k8s related to readiness gates vs status conditions. Maybe either the example cluster specification, or the CRD, or both, need something to help Kubernetes understand that cluster deployment succeeded?

Any insights appreciated.

It sounds like you don’t have the dask-kubernetes operator controller running, at least some of the time.

When you create a DaskCluster resource the controller springs to action and creates all the necessary Pods for the scheduler, workers, etc. Part of the controller’s workflow moves the cluster status from Created to Pending to Running.

Then when you delete the DaskCluster the controller handles a couple of cleanup tasks. The controller adds the finalizer to the DaskCluster to ensure that the state doesn’t get blown away until the cleanup task has completed, then it removes the finalizer.

If a namespace is being deleted it needs to wait for all resources to have their finalizers removed before it can complete.

If you could share the exact steps you are following it would help to troubleshoot what’s going on. I suspect you are deleting the operator before deleting the DaskCluster, which means the finalizer task never gets completed and the namespace deletion enters a deadlock.