Multi-GPU dask gateway pods

Hi team,

Another question from me. I hope you don’t mind. I am trying to create a dask cluster via dask gateway in kubernetes (GKE) that has 4 GPUs per worker/pod/node and am encountering this very strange behavior where the workers are unexpectedly killed, then recreated over and over without submitting a single future to them. I can deploy the same cluster with only one GPU onto nodes that only have one GPU via dask gateway within the same GKE cluster and it works just fine. I have a hunch that it is trying to create a process per GPU and that is somehow leading to the crash. I am not sure how to specify multiple gpus per process, however.

Cannot for the life of me find anything in the docs on this issue or how to specify multiple gpus per process. Any help here is appreciated.

Edit - some more context - this shows up over and over again in the gateway server logs - Cluster default.0fb800507124441f8502914de152bb56 scaled to 3 - creating 1 workers, deleting 1 stopped workers Can’t for the life of me figure out why a worker is being ‘stopped’ then recreated instantaneously though

Resolved this by setting env var CUDA_VISIBLE_DEVICES=0 in the worker docker image

Although now only one of the four GPUS is visible to torch…any advice on how to keep them visible to torch but invisible to dask would be appreciated.

Hi @secrettoad,

Could you share some code and configuration you are using for setting up the cluster? What kind of dask Worker are you creating?

I recommend you check out dask-cuda which has a dask cuda worker command and a dask_cuda.CUDAWorker which handles worker startup on GPU machines to avoid problems like this.

I also think multiple GPUs per process is not something supported. What do you use as scaling kwarg on your cluster?