We’ve been battling these timeout issues for quite some time and could use some guidance. In our troubleshooting efforts we’ve scaled back to a fresh kubernetes install, no ingress controllers, just bare vanilla install. We deploy DaskHub with the following command:
helm upgrade --wait --install --render-subchart-notes dhub dask/daskhub --create-namespace --namespace=dhub
Everything spins up. We login, can create a cluster, can view the cluster link, but when we try to actually connect the client it times out.
The Python code ran :
from dask_gateway import GatewayCluster
cluster = GatewayCluster() # connect to Gateway
cluster.adapt(minimum=2, maximum=10) # scale cluster
client = cluster.get_client() # connect Client to Cluster
There’s a lot in the error from get_client(), I can post the whole thing if it helps, but the final exception is:
OSError: Timed out trying to connect to gateway://traefik-dhub-dask-gateway.dhub:80/dhub.934c8cce03d747109e310baf0671e751 after 30 s
At one point we had this working on our production JupytyerHub, but hit these client timeout issues awhile back and have not been able to fix it. We’ve looked at making sure our dask versions are the same across the board, ensured there’s no network policies blocking traffic, etc.
Why isn’t a plain vanilla DaskHub installation working on a fresh new Kubernetes cluster? Why are we timing out and how do we fix it across installations.
Hi @NickCote, welcome to Dask Discourse forum!
It’s hard to tell what could be the problem… I just did a DaskHub deployment on GKE yesterday and everything worked fine with the latest version.
Which DaskHub version are you using? Which Kubernetes version, and what type of Kubernetes (On premise?) ?
You don’t use any values.yaml file to customize your deployment?
Are you able to display the Scheduler Dashboard?
Do you see Worker pods being created?
Did you check Dask gateway pods logs?
It’s weird you are able to create a cluster but not connect a Client to it…
We are using the latest DaskHub Helm Chart, 2024.1.0. We had tried with the previous release as well and hit the same issue.
Kubernetes version is 1.28.5 and it was deployed with kubespray on top of Alma Linux on prem.
For this test we are not using any values.yaml file. We have another deployment where this issue is happening that does, for recreate though we scaled back to the basics. This test was run on new clean hardware with a clean K8s install.
We are able to view the Scheduler dashboard when displaying just the cluster information. We see the worker pods come up as well. The gateway API and Traefik logs don’t provide any information on the timeout. We see requests come in to create the cluster, which it does, but when using client there’s no information to track where this is breaking down. If we close the cluster we see that request come in as well. The Jupyter User pod logs only include the Python errors. We have not been able to find any information in the different pod logs where this traffic is being dropped or denied.
I have no idea of what could cause this… Could you give us the full stack trace even if I’m not really sure this will help a lot…
I’ve been looking to try and find the best places to troubleshoot this further. We have a production Daskhub system deployed that was working fine, and then hit these timeouts for a few months. We then went down the route of recreating on a new bare metal cluster to rule out variables and that had the same timeout errors.
I was able to deploy on a minikube cluster and had everything work. We also went ahead and upgraded the k8s version on our production cluster and after that everything is now “working” on our actual production instance. We still want to troubleshoot this further to better understand on the fresh kubespray deployed clusters that were having the same issue. I can get full stack traces off that reproduction to try and ID where the issue was.
I put working in quotes because although we are using the same container image for the Jupyter user and Dask workers we still see mismatched version errors that throw exceptions when running code. I see mentions in docs of specifying a conda environment to load when launching the cluster but those aren’t working to align versions. Exposing Cluster Options — Dask Gateway 2024.1.0 documentation I don’t see any conda environment options when following these docs, and specifying it leads to no change in versions. Our users can install their own conda environments for notebooks as well so I don’t understand how that sync would happen with dask gateway workers either.
In this case, you are using a customized values.yaml file?
This is just an example, but might not be applicable to all ClusterConfig. For Kubernetes, you would only be able to specify options described in KubeClusterConfig, for example the Docker image used. But his will only overload the default configuration you’ve put in your values.yaml file.
Anf or the other part, I still have no idea what would have caused the problem on your specific Kubernetes version and install.
cc @jacobtomlinson @consideRatio.
I’m not confident on what this is, but a key thing to rule out is that its network policies not being enforced well. To rule that out as a starting point, do
kubectl get netpol -A and see whats around, then try clearing everything you can re-create easily again.
As long as no network policy resource targets the pods involved, then they aren’t limited. At that point in time, see if things work and report back.
The network policy things involved should be egress for jupyterhub user pods, and ingress for dask-gateway’s dask-gateway-server pod (named api pod).
Also, for reference, GitHub - 2i2c-org/infrastructure: Infrastructure for configuring and deploying our community JupyterHubs. provides a jupyterhub+dask-gateway installation that doesn’t use the daskhub helm chart that could be useful as a reference to something known to work.
I’ll open another topic for this issue since it’s different from the original. Thanks for the response.
I’ll update when we are able to run the network policy commands on our recreate cluster. We started our original cluster by leveraging the 2i2c configurations a good amount. They have a hub deployed for us that we have used to compare our on-premise installation.