Hi team,
Another question from me. I hope you don’t mind. I am trying to create a dask cluster via dask gateway in kubernetes (GKE) that has 4 GPUs per worker/pod/node and am encountering this very strange behavior where the workers are unexpectedly killed, then recreated over and over without submitting a single future to them. I can deploy the same cluster with only one GPU onto nodes that only have one GPU via dask gateway within the same GKE cluster and it works just fine. I have a hunch that it is trying to create a process per GPU and that is somehow leading to the crash. I am not sure how to specify multiple gpus per process, however.
Cannot for the life of me find anything in the docs on this issue or how to specify multiple gpus per process. Any help here is appreciated.
Edit - some more context - this shows up over and over again in the gateway server logs - Cluster default.0fb800507124441f8502914de152bb56 scaled to 3 - creating 1 workers, deleting 1 stopped workers
Can’t for the life of me figure out why a worker is being ‘stopped’ then recreated instantaneously though