Hello everyone.
Pleasure to join this community. My name is Ivan and currently my team is investigating the capabilities of Dask framework.
We have set up Azure k8s Service and deployed Dask using this helm chart and everything seems to spin up out of the box.
To perform computations we create a cluster via this script:
from dask_gateway import Gateway
import time
gateway = Gateway()
num_workers = 64
cluster = gateway.new_cluster(shutdown_on_close=False)
cluster.scale(num_workers)
time.sleep(7)
client = cluster.get_client()
client
And everything went well till the moment some error suspended the cluster creation (I believe even a blunt kernel interruption during cell execution might be the case). So a bunch of clusters are now suspended in Pending state and we still figuring out how to remove them.
- We have tried re-starting AKS service - no luck;
- I have dropped the initial 2021.10.0 release and redeployed everything with the most recent one - no luck;
- I have tried to explicitly connect the gateway to one of those clusters and run cluster.shutdown() command - no luck;
from dask_gateway import Gateway
from dask.distributed import Client,progress
gateway = Gateway()
cluster_infos = sorted(list(gateway.list_clusters()), key=lambda x: x.status, reverse=True)
for cluster in cluster_infos:
print(cluster, cluster.name, sep="\n",)
ClusterReport<name=default.695282de9fa344baba31a86885cc1887, status=RUNNING>
default.695282de9fa344baba31a86885cc1887
ClusterReport<name=default.35d06a2119d143e5bc04a00fad405ed3, status=PENDING>
default.35d06a2119d143e5bc04a00fad405ed3
ClusterReport<name=default.90ea366d29174c41a300e329dee197e0, status=PENDING>
default.90ea366d29174c41a300e329dee197e0
ClusterReport<name=default.a9a67ddc94bf4c7383bb2d49bf21c304, status=PENDING>
default.a9a67ddc94bf4c7383bb2d49bf21c304
ClusterReport<name=default.af61aabf03af47ed93e039063fc90350, status=PENDING>
default.af61aabf03af47ed93e039063fc90350
Any advice on how to get rid of those pending clusters?
Many thanks in advance.
P.S.
I am not an expert with k8s, so using Lens for checking the cluster state.