Multi-GPU dask gateway pods

secrettoad · November 25, 2023, 1:10am

Hi team,

Another question from me. I hope you don’t mind. I am trying to create a dask cluster via dask gateway in kubernetes (GKE) that has 4 GPUs per worker/pod/node and am encountering this very strange behavior where the workers are unexpectedly killed, then recreated over and over without submitting a single future to them. I can deploy the same cluster with only one GPU onto nodes that only have one GPU via dask gateway within the same GKE cluster and it works just fine. I have a hunch that it is trying to create a process per GPU and that is somehow leading to the crash. I am not sure how to specify multiple gpus per process, however.

Cannot for the life of me find anything in the docs on this issue or how to specify multiple gpus per process. Any help here is appreciated.

Edit - some more context - this shows up over and over again in the gateway server logs - Cluster default.0fb800507124441f8502914de152bb56 scaled to 3 - creating 1 workers, deleting 1 stopped workers Can’t for the life of me figure out why a worker is being ‘stopped’ then recreated instantaneously though

secrettoad · November 25, 2023, 4:42pm

Resolved this by setting env var CUDA_VISIBLE_DEVICES=0 in the worker docker image

secrettoad · November 25, 2023, 6:19pm

Although now only one of the four GPUS is visible to torch…any advice on how to keep them visible to torch but invisible to dask would be appreciated.

guillaumeeb · November 28, 2023, 9:15pm

Hi @secrettoad,

Could you share some code and configuration you are using for setting up the cluster? What kind of dask Worker are you creating?

jacobtomlinson · November 30, 2023, 6:07pm

I recommend you check out dask-cuda which has a dask cuda worker command and a dask_cuda.CUDAWorker which handles worker startup on GPU machines to avoid problems like this.

guillaumeeb · December 1, 2023, 3:17pm

I also think multiple GPUs per process is not something supported. What do you use as scaling kwarg on your cluster?

Topic		Replies	Views
Tuning Distributed Dask Clusters with GPUs Distributed dask-gateway , distributed	3	1006	February 21, 2022
Multiple processes per worker while using gateway Distributed dask-gateway , distributed	7	856	April 27, 2022
Installing Dask Workers on Partcular Node Pool Distributed dask-gateway , distributed	3	368	January 31, 2024
Pods with GPU request are unscheduleable, using dask_kubernetes.KubeCluster and karpenter Distributed dask-kubernetes , kubernetes , gpu	2	1305	August 30, 2022
Using Dask Gateway to create clusters with different images Distributed dask-gateway	3	259	October 12, 2023

Multi-GPU dask gateway pods

Related topics