Very low CPU utilization on SLURMCluster

Hello, I was trying to set a dask + slurm cluster and do some benchmarks.I am attaching the report of my call - Dask Performance Report. Here args.Nodes=8, args.cores=32, i.e. running on 32*8 cores total over a cluster. I explicitly want to say that I would like to have 1 process per core and each process to have 1 thread. And also to assign 1 task per node.

When going through the System tab of the report, I see that there is very low CPU utilization (less than 10% on avarage). I entered several nodes, where computations were done and htop was showing me that all CPU-s are full. Is the CPU report reliable? Or if it is reliable, what can I do to increase the CPU utilization?

Hi @ikabadzhov,

Some comments I can make from your report:

On your report, you only have 160cores. So I guess only 5 of the 8 jobs generated by you scale call made it to running state before your computation.

And you got that part perfectly right. This can be seen in the code snippet where process kwarg is equal to cores, but also on the summary of the report where 160 workers (eg. different processes) are identified.

This part I don’t know. I’m not sure of what the system tab reflects. But I can tell by looking at your task stream that you’ve definitely 160 computations active at the same time.

Thanks @guillaumeeb for the response. I was trying to investigate the issue further.

Following the tutorial from here - Dask on HPC Introduction - YouTube [18:40] I ssh-ed into my cluster and was checking “live” how my workers are processing the jobs. Almost every worker was having very high CPU usage in majority of the time.

However, in the end, the performance report that I generated was similar to the one linked here (I fixed the number of workers), but again in the report the CPU utlization was very low.
When I refer to “tabs”, if you open the link , on the top there should be: <Summary, Task Stream, System, …>. The System tab tells me very low CPU utlization. And this is the CPU utilization that I have suspicion of not being reliable enough.

Right now, I have the “confirmation” from htop that a node is fully busy during computation, and also from the ssh-ed live report. I hope I formulated my suspicion clearer now.

I’m not sure if the system tab represent overall CPU utilization across the Dask cluster. If this is the case, then it obviously looks wrong compared to what you see. But I think it might represent CPU usage on scheduler side (I should try for myself, but this is not possible for me at the moment).

1 Like