Daskhub Helm chart leads to client timeouts

NickCote · January 8, 2024, 11:12pm

Hello,
We’ve been battling these timeout issues for quite some time and could use some guidance. In our troubleshooting efforts we’ve scaled back to a fresh kubernetes install, no ingress controllers, just bare vanilla install. We deploy DaskHub with the following command:

helm upgrade --wait --install --render-subchart-notes dhub dask/daskhub --create-namespace --namespace=dhub

Everything spins up. We login, can create a cluster, can view the cluster link, but when we try to actually connect the client it times out.

The Python code ran :

from dask_gateway import GatewayCluster

cluster = GatewayCluster()  # connect to Gateway

cluster.adapt(minimum=2, maximum=10)  # scale cluster

client = cluster.get_client()  # connect Client to Cluster

client

There’s a lot in the error from get_client(), I can post the whole thing if it helps, but the final exception is:

OSError: Timed out trying to connect to gateway://traefik-dhub-dask-gateway.dhub:80/dhub.934c8cce03d747109e310baf0671e751 after 30 s

At one point we had this working on our production JupytyerHub, but hit these client timeout issues awhile back and have not been able to fix it. We’ve looked at making sure our dask versions are the same across the board, ensured there’s no network policies blocking traffic, etc.

Why isn’t a plain vanilla DaskHub installation working on a fresh new Kubernetes cluster? Why are we timing out and how do we fix it across installations.

guillaumeeb · January 10, 2024, 3:16pm

Hi @NickCote, welcome to Dask Discourse forum!

It’s hard to tell what could be the problem… I just did a DaskHub deployment on GKE yesterday and everything worked fine with the latest version.

Which DaskHub version are you using? Which Kubernetes version, and what type of Kubernetes (On premise?) ?
You don’t use any values.yaml file to customize your deployment?
Are you able to display the Scheduler Dashboard?
Do you see Worker pods being created?
Did you check Dask gateway pods logs?

It’s weird you are able to create a cluster but not connect a Client to it…

NickCote · January 10, 2024, 4:40pm

We are using the latest DaskHub Helm Chart, 2024.1.0. We had tried with the previous release as well and hit the same issue.

Kubernetes version is 1.28.5 and it was deployed with kubespray on top of Alma Linux on prem.

For this test we are not using any values.yaml file. We have another deployment where this issue is happening that does, for recreate though we scaled back to the basics. This test was run on new clean hardware with a clean K8s install.

We are able to view the Scheduler dashboard when displaying just the cluster information. We see the worker pods come up as well. The gateway API and Traefik logs don’t provide any information on the timeout. We see requests come in to create the cluster, which it does, but when using client there’s no information to track where this is breaking down. If we close the cluster we see that request come in as well. The Jupyter User pod logs only include the Python errors. We have not been able to find any information in the different pod logs where this traffic is being dropped or denied.

guillaumeeb · January 12, 2024, 4:36pm

I have no idea of what could cause this… Could you give us the full stack trace even if I’m not really sure this will help a lot…

NickCote · January 25, 2024, 6:18pm

Thanks @guillaumeeb,
I’ve been looking to try and find the best places to troubleshoot this further. We have a production Daskhub system deployed that was working fine, and then hit these timeouts for a few months. We then went down the route of recreating on a new bare metal cluster to rule out variables and that had the same timeout errors.
I was able to deploy on a minikube cluster and had everything work. We also went ahead and upgraded the k8s version on our production cluster and after that everything is now “working” on our actual production instance. We still want to troubleshoot this further to better understand on the fresh kubespray deployed clusters that were having the same issue. I can get full stack traces off that reproduction to try and ID where the issue was.

I put working in quotes because although we are using the same container image for the Jupyter user and Dask workers we still see mismatched version errors that throw exceptions when running code. I see mentions in docs of specifying a conda environment to load when launching the cluster but those aren’t working to align versions. Exposing Cluster Options — Dask Gateway 2024.1.0 documentation I don’t see any conda environment options when following these docs, and specifying it leads to no change in versions. Our users can install their own conda environments for notebooks as well so I don’t understand how that sync would happen with dask gateway workers either.

guillaumeeb · January 26, 2024, 2:19pm

In this case, you are using a customized values.yaml file?

This is just an example, but might not be applicable to all ClusterConfig. For Kubernetes, you would only be able to specify options described in KubeClusterConfig, for example the Docker image used. But his will only overload the default configuration you’ve put in your values.yaml file.

Anf or the other part, I still have no idea what would have caused the problem on your specific Kubernetes version and install.

cc @jacobtomlinson @consideRatio.

consideRatio · January 26, 2024, 9:17pm

I’m not confident on what this is, but a key thing to rule out is that its network policies not being enforced well. To rule that out as a starting point, do kubectl get netpol -A and see whats around, then try clearing everything you can re-create easily again.

As long as no network policy resource targets the pods involved, then they aren’t limited. At that point in time, see if things work and report back.

The network policy things involved should be egress for jupyterhub user pods, and ingress for dask-gateway’s dask-gateway-server pod (named api pod).

Also, for reference, GitHub - 2i2c-org/infrastructure: Infrastructure for configuring and deploying our community JupyterHubs. provides a jupyterhub+dask-gateway installation that doesn’t use the daskhub helm chart that could be useful as a reference to something known to work.

NickCote · January 29, 2024, 5:54pm

I’ll open another topic for this issue since it’s different from the original. Thanks for the response.

I’ll update when we are able to run the network policy commands on our recreate cluster. We started our original cluster by leveraging the 2i2c configurations a good amount. They have a hub deployed for us that we have used to compare our on-premise installation.

NickCote · February 27, 2024, 6:10pm

A coworker deployed the daskhub chart on a fresh k8s cluster with Cilium as the CNI and discovered that the JupyterHub chart includes default network policies that block communication with Gateway. He opened an issue with the information he found while troubleshooting this further.

github.com/dask/helm-chart

Network policies break daskhub

opened 08:15PM - 16 Feb 24 UTC

kcote-ncar

**Describe the issue**: It appears the default network policies from the jupyte…rhub helm chart breaks communication with dask-gateway and the kube-apiserver. I deployed daskhub with default values onto a vanilla K8s cluster with a CNI that supports network policies (cilium). `helm upgrade --install --create-namespace --namespace jhub01 jhub01 dask/daskhub` With this deployment, the jupyterhub pod will not spawn and I receive this output: ![image](https://github.com/dask/helm-chart/assets/43788363/4a484716-b02f-4f58-a7d5-293f1922fcac) Using hubble, I am able to see the packets are being dropped via network policy: `hubble observe -n jhub01 -t drop -f` > Feb 16 18:05:30.839: jhub01/hub-fc455bdb8-2n7ct:34144 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 18:05:31.862: jhub01/hub-fc455bdb8-2n7ct:34144 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 18:05:33.910: jhub01/hub-fc455bdb8-2n7ct:34144 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 18:05:35.043: jhub01/hub-fc455bdb8-2n7ct:38918 (ID:133898) <> XXX.XXX.XXX.148:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN) Feb 16 18:05:36.086: jhub01/hub-fc455bdb8-2n7ct:38918 (ID:133898) <> XXX.XXX.XXX.148:6443 (kube-apiserver) Policy denied DROPPED (TCP Flags: SYN) If I allow access to the kube-apiserver (reference ticket below), the pod will then spawn but I still get drops for dask-gateway communication: >Feb 16 19:40:58.002: jhub01/jupyter-test:34604 (ID:146419) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:00.498: jhub01/jupyter-test:53158 (ID:146419) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:06.130: jhub01/jupyter-test:34604 (ID:146419) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:07.936: jhub01/hub-5fd4dbdb78-gmnvw:58384 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:08.950: jhub01/hub-5fd4dbdb78-gmnvw:58384 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:10.998: jhub01/hub-5fd4dbdb78-gmnvw:58384 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:14.914: jhub01/jupyter-test:43114 (ID:146419) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Feb 16 19:41:15.030: jhub01/hub-5fd4dbdb78-gmnvw:58384 (ID:133898) <> jhub01/traefik-jhub01-dask-gateway-7665b69c66-hzwrj:8000 (ID:170445) Policy denied DROPPED (TCP Flags: SYN) Here is the list of network policies defined for the whole cluster: `kubectl get networkpolicies.networking.k8s.io -A` > NAMESPACE NAME POD-SELECTOR AGE jhub01 hub app=jupyterhub,component=hub,release=jhub01 22h jhub01 proxy app=jupyterhub,component=proxy,release=jhub01 22h jhub01 singleuser app=jupyterhub,component=singleuser-server,release=jhub01 22h Everything works when I deploy this network policy into the namespace: ```` apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-all-ingress-egress spec: podSelector: {} egress: - {} ingress: - {} policyTypes: - Egress - Ingress ```` **Anything else we need to know?**: Bare Metal - K8s Server Version: v1.29.1 CRI-O Version: v1.29.1 Cilium Version: v1.15.1 This issue is related and is why we are seeing drops for the kube-apiserver: - https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/3202 If I allow access to the kube-apiserver I then hit this issue: - https://github.com/dask/helm-chart/issues/431 I think that the daskhub chart should deploy network policies to allow the jupyterhub pod to communicate with dask-gateway. Or perhaps something about the correct network policies should be documented since the default values don't allow dask-gateway communication? **Environment**: - Dask version: daskhub-2024.1.1 - Python version: - Operating System: AlmaLinux 9 - Install method (conda, pip, source): helm

Topic		Replies	Views
Deploy dask gateway on Kubernetes as a JupyterHub service Deploying Dask kubernetes	8	1439	January 29, 2022
Dask gateway server shuts down issue Deploying Dask dask-gateway , kubernetes , distributed	1	181	April 27, 2023
Deploying Dask on an rke2 custom cluster Deploying Dask	8	202	April 4, 2024
Worker pods exist but client cannot connect to them or workers do not accept jobs Deploying Dask dask-gateway , kubernetes , distributed	7	76	June 27, 2024
ClientResponseError: 401, message='Unauthorized' Deploying Dask dask-gateway , kubernetes	8	260	July 13, 2023

Daskhub Helm chart leads to client timeouts

Related topics