[AKS based cluster with Helm] - Cannot remove PENDING clusters

Hello everyone.
Pleasure to join this community. My name is Ivan and currently my team is investigating the capabilities of Dask framework.

We have set up Azure k8s Service and deployed Dask using this helm chart and everything seems to spin up out of the box.

To perform computations we create a cluster via this script:

from dask_gateway import Gateway
import time

gateway = Gateway()

num_workers = 64

cluster = gateway.new_cluster(shutdown_on_close=False)
client = cluster.get_client()

And everything went well till the moment some error suspended the cluster creation (I believe even a blunt kernel interruption during cell execution might be the case). So a bunch of clusters are now suspended in Pending state and we still figuring out how to remove them.

  • We have tried re-starting AKS service - no luck;
  • I have dropped the initial 2021.10.0 release and redeployed everything with the most recent one - no luck;
  • I have tried to explicitly connect the gateway to one of those clusters and run cluster.shutdown() command - no luck;
from dask_gateway import Gateway

from dask.distributed import Client,progress

gateway = Gateway()
cluster_infos = sorted(list(gateway.list_clusters()), key=lambda x: x.status, reverse=True)

for cluster in cluster_infos:
    print(cluster, cluster.name, sep="\n",)

ClusterReport<name=default.695282de9fa344baba31a86885cc1887, status=RUNNING>
ClusterReport<name=default.35d06a2119d143e5bc04a00fad405ed3, status=PENDING>
ClusterReport<name=default.90ea366d29174c41a300e329dee197e0, status=PENDING>
ClusterReport<name=default.a9a67ddc94bf4c7383bb2d49bf21c304, status=PENDING>
ClusterReport<name=default.af61aabf03af47ed93e039063fc90350, status=PENDING>

Any advice on how to get rid of those pending clusters?

Many thanks in advance.

I am not an expert with k8s, so using Lens for checking the cluster state.

Hi @Pirognoe and welcome to Discourse! Thanks for providing the reproducible snippets and what you’ve already tried, these details are very helpful. I’m no Dask Gateway expert, but I think @jcrist might be able to shed some light on this?

1 Like

I’m surprised that calling cluster.shutdown() didn’t work. That should successfully transition the cluster into stopped state. Does that call error? Or succeed but nothing changes? If if it succeeds but nothing changes, I’d check for issues in the dask-gateway controller pod logs - that’s the service that’s responsible for transitioning clusters from state to state.

If you don’t care about debugging what’s going on, you can also manually delete the DaskCluster objects with kubectl. I haven’t used kubectl in over a year, but iirc you’d want to do something like the following:

kubectl delete daskcluster af61aabf03af47ed93e039063fc90350

Thanks , the code snippet helped !

1 Like