How to efficiently monitor GPU usage without a dashboard?

Hello folks,

I’m trying to do a calculation of a 30 GB inside 4 clustered GPUs.
Even if I split this data into small chunks of 100 MB, the memory increase so much that it reports allocation issues.
The point is… How can I efficiently profile the GPU memory usage of my process? For further information, I’m using CuPy and Dask Arrays.
If I use only CPU and local memory, I could easily use the dask-memusage plugin, but unfortunately, it does not work with GPUs.

I’m not using the dashboard because I’m running on a cluster that does not let me open ports externally.

Any thoughts and suggestions are welcome.

Hi @jcfaracco,

Did you go through
https://distributed.dask.org/en/stable/diagnosing-performance.html

or
https://docs.dask.org/en/stable/diagnostics-distributed.html

There might be some useful tools here like performance_report or MemorySampler for example. I’m not sure how it goes along with GPUs.

The only other solution I see is using external tooling like nvidia-smi (there might be some package that are able to record output of this command).

You could also try without GPUs and see how it goes.

Also, did you tried to use SSH port forwarding?

I wrote a similar plugin to dask-memusage. If anyone is interested: GitHub - discovery-unicamp/dask-memusage-gpus: A thread based and low-impact GPU memory profiler for Dask.

It is missing documentation, but this is something I will do in the next weeks.

1 Like