Just curious, has anyone been able to get a proper dask gateway service to run on EKS with karpenter powered autoscaling? I have been trying for almost a year, and while it does work at smaller scale, once we try to get up to 300+ workers on the cluster we start to see strange networking issues a la TCP handshake failure, workers and schedulers getting stuck and then failing out eventually after hours, sudden restarts. We tried a whole array of things to fix including increasing coredns replicas and playing with karpenter settings.
Ultimately we decided to roll back to using Cluster Autoscaler and all of our issues mostly faded away, we can comfortably run 1000+ workers on our cluster across 4 different dask gateways in different namespaces. Trying to figure out if theres a dask problem with using karpenter or a karpenter problem with using dask. Anyone have any similar experiences?
Hi @vish727, welcome to Dask Discourse,
I have to admit I didn’t know about Karpenter before this topic, so I won’t probably help a lot…
I’m not sure how it work, but Dask Workers and especially Schedulers have some states, they cannot be moved around nodes, stopped and restarted, that might be a reason? Or maybe Karpenter has some special way of handling network coms?
cc @jacobtomlinson
Hey @guillaumeeb, thanks for the welcome, am a longtime lurker, first time poster. Thanks for taking the time to reply. Karpenter does have some optimization functionality where it can consolidate pods from underutilized nodes, but I had configured this to only occur when nodes were empty, ie < 10% utilization, not during our peak periods.
The other thing is spot disruptions of our workers, but this is nothing new and does not happen frequently enough.
Between Cluster Autoscaler and karpenter, communication with AWS api works differently. Karpenter is directly (and very frequently) communicating via the EC2 fleet api, network traffic seems to be much greater compared to CAS, which uses autoscaling groups.
I’m trying to work with AWS on getting deeper into this as we’d like to take advantage of karpenter for our prod EKS dask gateway cluster. I just wanted to see if anyone else has run into any similar problems or even attempted what we are attempting.
TBH I’ve never heard of any other shop running dask-gateway at scale like we do in a production scenario and am curious to see if anyone on here has any experience. We have 4 dask gateways on the cluster and dhub installed. The dask gateways are waiting for requests from our batch jobs, and at times one gateway can have 2000+ workers running. This setup has worked for us fine with the cluster autoscaler.
2 Likes
Wow, this looks like quite a good story! Would you have interest to share your use case in a blog post or something? cc @scharlottej13.
1 Like
Sure, I think thats something that could be interesting. Let me check internally to see what can be shared, but I’d love to connect for sure. Feel free to have someone DM me.
1 Like