GPU memory within container

Hi dask team,

I have question that I imaging there is an answer to, although I can’t seem to find it anywhere. I’m running a dask cluster in GKE with nodes that have nvidia GPUs attached to them. I am successfully using the gpus in a distributed way by pushing pytorch models and datasets to the cuda devices, but I am struggling to monitor GPU memory usage.

I see charts in the diagnostic dashboard labelled gpu utilization and gpu memory, but they are completely blank for me, even when running dask-worker containers on them.

My question is, where do those/does dask look for gpu utilization metrics? My guess is that this is an issue with dask inside the container not having sufficient access to the GPU in order to report on it, but without knowing where it is looking I’m not sure where to start debugging.

Thank you in advance for your help.

Hi @secrettoad,

I did a quick search on the code, and Dask uses pynvml to get GPU utilization metrics. You can find some code here: https://github.com/dask/distributed/blob/405c011919bc7176bef8451be02578ca15931110/distributed/worker.py#L3327.

Then the Dashboard just queries these metrics: https://github.com/dask/distributed/blob/405c011919bc7176bef8451be02578ca15931110/distributed/dashboard/components/nvml.py#L131.

Hope that helps.

1 Like

Make sure you are launching your workers with dask-cuda-worker and have the dask-cuda package installed in the worker containers.

@jacobtomlinson thank you! i did not realize that was necessary