Persistent memory profiling/logging

Hi, as part of debugging large Dask image operations run on HPC, I’m looking for a way to do real-time logging of (cluster) memory.

I’ve tried a number of options using the functionality provided by Dask:

  • The Dask dashboard (using SSH tunnelling): this is really nice, but it seems unstable, timing out at some point, also I don’t want to monitor this over many hours. It doesn’t help the tunnel needs 2 jumps.
  • MemorySampler: This is logging exactly what I need, but it seems unstable over a long time, failing the entire HPC job. Moreover if an unexpected OOM occurs, then results are not stored.
  • Expose logging by lowering the ‘silence’ level to info. I haven’t found any way to get memory information this way.

So ideally I’m looking for something similar to what MemorySampler does, periodically monitoring memory use, but outputting this real-time e.g. appending entries to a text file periodically, ideally without having to create a separate thread to explicitly query this periodically.

Does this functionality exist? What would be the best way to do this?
I’ve gone through a lot of documentation and blogs also specifically on HPC, but was not able to find this, so any help much appreciated.

Hi @folterj, welcome to Dask Discourse forum!

Should not be unstable, but anyway, did you take a look at Performance reports?

Is it? What kind of unstability do you have?

In addition to what you already pointed, you might take a look a Fine performance metrics, but not sure if there is what you want in it.

Another idea would be to use Prometheus endpoint, with Grafana or a correct client, you could probably get all the information you want there in real time.

What HPC scheduler system are you using? There are usually APIs or modules to check and log this kind of resources along job on them, did you ask the administrator of this system?

1 Like

Hi @guillaumeeb , thank you for this info!
The Performance reports give nice information, much better than MemorySampler.
However, both only output information on completion. So in case of an unexpected OOM error, the slurm job is killed and the output won’t be saved.

Regarding your questions:
We use slurm, and I have asked but our HPC team didn’t know of any way to measure e.g. memory etc.

The error message:

  File ".../lib/python3.12/site-packages/distributed/client.py", line 2403, in _gather
    raise exception.with_traceback(traceback)
distributed.client.FutureCancelledError: finalize-hlgfinalizecompute-a81d95fafb194637b56e32bd23a7f975 cancelled for reason: scheduler-connection-lost.
Client lost the connection to the scheduler. Please check your connection and re-run your work.

PS: I also often see this non-fatal error, impacting monitoring (censored IP address):

2025-05-27 11:12:22,988 - distributed.scheduler - WARNING - Worker failed to heartbeat for 304s; removing: <WorkerState 'inproc://.../12162/15', name: 0, status: running, memory: 0, processing: 0>
2025-05-27 11:12:22,990 - distributed.scheduler - WARNING - Workers ['inproc://.../12162/15'] do not use a nanny and will be terminated without restarting them
2025-05-27 11:12:22,991 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'inproc://.../12162/15'.
2025-05-27 11:12:22,993 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.

I guess you could use some system commands that you could start at the beginning of each job, like dstat or equivalent.

This might be several things: network error, Scheduler or client overload, or any resource lacking.

This also looks like some network error or overload Scheduler or Worker…

1 Like