Dask Cluster on k8s - Managing Multiple Users Submitting Jobs Concurrently

Hi everyone.

We’ve deployed Dask cluster on top of k8s ( 5 nodes, each having 200GB of RAM and 50 vCPUs), using an example for this link. When a single developer is using the cluster, everything works perfectly. However, I’m thinking about a team of 3-5 people who will need to run jobs on the cluster simultaneously. My concern is how to efficiently manage multiple users submitting jobs at the same time. For instance, if Developer A submits a job that uses about 60% of the available RAM, and then Developer B submits a job that may require 45% of the RAM, this could lead to resource contention.

Is there a way to implement a queue or similar mechanism to check resource availability before submitting a job to the scheduler? If resources are insufficient, the job would wait in the queue until there’s enough capacity. Essentially, I’m looking for the best approach (the most Dasky approach if I can say like that) to handle day-to-day development using Dask on k8s. Any ideas or feedback would be greatly appreciated. Thanks!

we used to do this and switched to ephemeral clusters due to resource overload issues and making the scheduler and workers work to hard.

i dont recommend this- and this will require in general more work on k8s side than dask - and yes it can be done but doesnt worth the trouble.

you would need to implement logic using futures or async - await.

https://distributed.dask.org/en/latest/asynchronous.html

1 Like

Thanks for the feedback, @Hvuj . Just to clarify, are you suggesting that by implementing futures or async and using a static cluster that everyone can connect to, we can address the issues of team members working on the same cluster and efficiently managing the memory for whatever they submit to the cluster?

yes its possible - we did it but it created to much trouble.
its easier and cheaper to create ephemeral clusters with less resources and / or dynamic allocation of resources.

1 Like

As @Hvuj is saying, the most Dasky approach here is ephemeral clusters, one for each user. The resources priority or sharing would have to be handled by the resource orchestration system, Kubernetes here. You might want to limit every user, or to have an autoscaling approach. Do you have this K8S system on premise?

1 Like

Hi @guillaumeeb

Yes, my k8s system is on-prem, so I have limited amount of resources.

That will make things harder, but I’ll try to aim at ephemeral clusters anyway. That also means users have to clean things properly…

2 Likes

Thank you guys for the comments and suggestions! Ephemeral cluster will do the job for now.

Please do not hesitate to get back to us and describe the K8S configuration you put in place for that!