Dask Controller (Dask Gateway) Sometimes Hanges

Hello All! We’ve been running into a problem that shows up randomly with our dask gateway cluster. Sometimes out of nowhere, the dask gateway controller will stop receiving requests to create the cluster. The API says it’s received a request to create the cluster, but then nothing happens on the controller. When I restart the controller pod, the cluster that was waiting to start now springs to life. This last time, I did see a “Controller timeout” error on the controller pod from about a week or two prior from trying to run the cluster that hung. Is there something being done incorrectly here that would cause the controller to just hang like that?

We are currently using 2024.1.0 for the controller’s image. We are running within an AKS environment on k8s 1.28.

Also this problem will happen randomly. It was gone for a few months, then randomly started back up. Any ideas?

Hi @jbeeman, welcome to Dask Discourse!

Unfortunately, no idea jumps out of my mind. Dask Gateway is lacking support a bit for the time being, but I think this post is more an issue for the project. cc @jacobtomlinson.

Are you really tied to Dask Gateway, or could you use Dask Kubernetes?

Hi @guillaumeeb thank you for your quick response!

Unfortunately we are dependent on Dask Gateway because we have a hard requirement to place a security layer in front of dask. I’ve actually updated my own offshoot of the jupyterhub authenticator to meet some of the specific security requirements my company mandates.

If there are other solutions out there for authentication with dask, I’m open to other approaches!

We just ran into this issue again today. It seems to be stuck in Kube Controller trying to remove an expired cluster record, just looping constantly. I’ll update this forum post if I find anything else out. I’ll be working to turn on a DEBUG log level to see if I can get more info next time it hangs for us.

Well, I’m not sure. Do you have any advice @jacobtomlinson?

Dask supports client side certificates for authentication, but that’s it.

All good.

I think I figured out the culprit. When updating from dask gateway version 6.2022 up to the 1.2024 version, you need to recreate the Dask Gateway CRD. After recreating the CRD, we have not seen any crashes since (waited a month to respond to hopefully fully rule out an incorrect estimate)

Thank you for your help with this issue!

1 Like