We’ve deployed Dask cluster on top of k8s ( 5 nodes, each having 200GB of RAM and 50 vCPUs), using an example for this link. When a single developer is using the cluster, everything works perfectly. However, I’m thinking about a team of 3-5 people who will need to run jobs on the cluster simultaneously. My concern is how to efficiently manage multiple users submitting jobs at the same time. For instance, if Developer A submits a job that uses about 60% of the available RAM, and then Developer B submits a job that may require 45% of the RAM, this could lead to resource contention.
Is there a way to implement a queue or similar mechanism to check resource availability before submitting a job to the scheduler? If resources are insufficient, the job would wait in the queue until there’s enough capacity. Essentially, I’m looking for the best approach (the most Dasky approach if I can say like that) to handle day-to-day development using Dask on k8s. Any ideas or feedback would be greatly appreciated. Thanks!
Thanks for the feedback, @Hvuj . Just to clarify, are you suggesting that by implementing futures or async and using a static cluster that everyone can connect to, we can address the issues of team members working on the same cluster and efficiently managing the memory for whatever they submit to the cluster?
yes its possible - we did it but it created to much trouble.
its easier and cheaper to create ephemeral clusters with less resources and / or dynamic allocation of resources.
As @Hvuj is saying, the most Dasky approach here is ephemeral clusters, one for each user. The resources priority or sharing would have to be handled by the resource orchestration system, Kubernetes here. You might want to limit every user, or to have an autoscaling approach. Do you have this K8S system on premise?