Dask Controller (Dask Gateway) Sometimes Hanges

Hello All! We’ve been running into a problem that shows up randomly with our dask gateway cluster. Sometimes out of nowhere, the dask gateway controller will stop receiving requests to create the cluster. The API says it’s received a request to create the cluster, but then nothing happens on the controller. When I restart the controller pod, the cluster that was waiting to start now springs to life. This last time, I did see a “Controller timeout” error on the controller pod from about a week or two prior from trying to run the cluster that hung. Is there something being done incorrectly here that would cause the controller to just hang like that?

We are currently using 2024.1.0 for the controller’s image. We are running within an AKS environment on k8s 1.28.

Also this problem will happen randomly. It was gone for a few months, then randomly started back up. Any ideas?

Hi @jbeeman, welcome to Dask Discourse!

Unfortunately, no idea jumps out of my mind. Dask Gateway is lacking support a bit for the time being, but I think this post is more an issue for the project. cc @jacobtomlinson.

Are you really tied to Dask Gateway, or could you use Dask Kubernetes?

Hi @guillaumeeb thank you for your quick response!

Unfortunately we are dependent on Dask Gateway because we have a hard requirement to place a security layer in front of dask. I’ve actually updated my own offshoot of the jupyterhub authenticator to meet some of the specific security requirements my company mandates.

If there are other solutions out there for authentication with dask, I’m open to other approaches!

We just ran into this issue again today. It seems to be stuck in Kube Controller trying to remove an expired cluster record, just looping constantly. I’ll update this forum post if I find anything else out. I’ll be working to turn on a DEBUG log level to see if I can get more info next time it hangs for us.

Well, I’m not sure. Do you have any advice @jacobtomlinson?

Dask supports client side certificates for authentication, but that’s it.