Pleasure to join this community. My name is Ivan and currently my team is investigating the capabilities of Dask framework.
We have set up Azure k8s Service and deployed Dask using this helm chart and everything seems to spin up out of the box.
To perform computations we create a cluster via this script:
from dask_gateway import Gateway import time gateway = Gateway() num_workers = 64 cluster = gateway.new_cluster(shutdown_on_close=False) cluster.scale(num_workers) time.sleep(7) client = cluster.get_client() client
And everything went well till the moment some error suspended the cluster creation (I believe even a blunt kernel interruption during cell execution might be the case). So a bunch of clusters are now suspended in Pending state and we still figuring out how to remove them.
- We have tried re-starting AKS service - no luck;
- I have dropped the initial 2021.10.0 release and redeployed everything with the most recent one - no luck;
- I have tried to explicitly connect the gateway to one of those clusters and run cluster.shutdown() command - no luck;
from dask_gateway import Gateway from dask.distributed import Client,progress gateway = Gateway() cluster_infos = sorted(list(gateway.list_clusters()), key=lambda x: x.status, reverse=True) for cluster in cluster_infos: print(cluster, cluster.name, sep="\n",)
Any advice on how to get rid of those pending clusters?
Many thanks in advance.
I am not an expert with k8s, so using Lens for checking the cluster state.