Not able to get new cluster from a Dask-gateway deployed on a local k8s cluster

I am following the general dask-gateway implementation documentation to deploy one on a local k8s cluster hosted via Rancher desktop.

I reached the point where i have the traefik proxy, controller and gateway deployed and can connect to it ( i can f ex list the dask clusters ).

I am unable though to create new clusters via the gateway (currently trying in a jupyter notebook).In a terminal I am able to see in the gateway pod’s logs that the post request for a cluster did come through and that the cluster is running, but the cell in the jupyter notebook is forever running and doesn’t receive a response. Cancelling the cell makes a delete request to the gateway api and the dask cluster is then closed.

Like so i am unable to perform any graph execution processes.

I have the following thoughts and wanted to see if i could get some advice on it:

I had difficulties using the default values.yml file provided in the documentation. Specifically the images used don’t seem to exist (at least i didn’t manage to pull them via docker command). I am currently using the daskgateway/dask-gateway-server and daskgateway/dask-gateway images to spin up the components on the cluster. I notice that those are pretty old (and on inspection use python 3.8) and i wonder if the issue is due to this. I also see other people use custom images to bring up their resources. Please advise if that is recommended.

Secondly i am using python 3.9 on my local machine and dask version 2023.9.3 i think. I wonder if the issue is that i am requesting a cluster of a different version on the client side while on the server side a older version is created.

Hi @Rizer-Spiner, welcome to Dask community!

Are you also able to see the Scheduler and Worker pods?

Cloud you share the values.yaml file you are using, and the code to create the cluster?

This is not normal and should not be the case. These images are 3 years old, this will definitly leads to problem to use them. Are you behind a proxy of some sort?

Thank you @guillaumeeb for the fast reply!

My values.yml file is a copy of the file at this link except that the images for the server part (gateway/scheduler/worker) have been switched to


name: daskgateway/dask-gateway-server
tag: "latest"

To attempt to create a new cluster through the gateway i use the following code in a jupyter notebook:


gateway = Gateway('http://localhost:64696')
cluster = gateway.new_cluster()

, where 64696 is the port i set rancher to forward trafic to the traefik proxy, as i haven’t had success with connecting to the gateway via the traefik external IP. I thought too that this specific issue might be because of a proxy issue, you are correct in assuming i am behind a corporate firewall, (and maybe that is why i can only access the gateway server only using port-forwarding), but i think that is overruled as the following method does return (even though it is an empty array):

gateway.list_clusters()

Ever since, i found i could use an already existing version of a helm chart specifically for dask-gateway and use that to deploy a dask gateway. My current command i am running right now is:

helm install --version 2023.1.1 myrelease dask/dask-gateway

Compared to the situation described in my previous reply, i can now successfully create clusters, but fail now to create worker pods (by scaling the cluster). I have provided some screenshots that present my current situation.

However you look at it - i’m kind of stuck :sweat_smile: Anticipated thanks for the help!

Cluster view after scaling in jupyter:

Look at the pods in k8s. The worker ones are in pending state, and don’t have any logs i can inspect.

Creating a client for the cluster and submitting a graph via the ‘get’ method results in the following error on the scheduler pod:

So you have a running Scheduler pod, could you inspect its log?

Then, could you try to use some kubectl commands to understand why the worker pods are in a pending state? This is probably linked to the K8S cluster state, but it’s hard to guess.

Finally, about your client error, are you still using old versions of images for your Dask Gateway pods?

And I can confirm that I’m able to pull the Package dask-gateway-server · GitHub Docker image.

The scheduler pods logs are shown in the screenshot above.

As mentioned, i couldn’t find any logs for the worker pods. The kubectl logs command doesn’t return anything.

As for the image, yes, i am still using the old ones. Will update with the latest and see what happens.

But you are able to see there are pending pods for Workers, can you check why there are in a pending state?

@guillaumeeb thank you for the advice and help!

In the end i managed to resolve my issue - in the end i ended up downloading the entire das-gateway chart from the helm repository instead of overloading the values.yml and using the chart from the web.

This changed my command from

helm upgrade dask-gateway dask-gateway —repo=https://helm.dask.org —install —namespace dask-gateway —values values.yml

To

helm upgrade dask-gateway dask-gateway

Where dask-gateway is the name of your folder chart.

I can use now all expected functionality. Thanks!

1 Like