Hi, as part of debugging large Dask image operations run on HPC, I’m looking for a way to do real-time logging of (cluster) memory.
I’ve tried a number of options using the functionality provided by Dask:
The Dask dashboard (using SSH tunnelling): this is really nice, but it seems unstable, timing out at some point, also I don’t want to monitor this over many hours. It doesn’t help the tunnel needs 2 jumps.
MemorySampler: This is logging exactly what I need, but it seems unstable over a long time, failing the entire HPC job. Moreover if an unexpected OOM occurs, then results are not stored.
Expose logging by lowering the ‘silence’ level to info. I haven’t found any way to get memory information this way.
So ideally I’m looking for something similar to what MemorySampler does, periodically monitoring memory use, but outputting this real-time e.g. appending entries to a text file periodically, ideally without having to create a separate thread to explicitly query this periodically.
Does this functionality exist? What would be the best way to do this?
I’ve gone through a lot of documentation and blogs also specifically on HPC, but was not able to find this, so any help much appreciated.
In addition to what you already pointed, you might take a look a Fine performance metrics, but not sure if there is what you want in it.
Another idea would be to use Prometheus endpoint, with Grafana or a correct client, you could probably get all the information you want there in real time.
What HPC scheduler system are you using? There are usually APIs or modules to check and log this kind of resources along job on them, did you ask the administrator of this system?
Hi @guillaumeeb , thank you for this info!
The Performance reports give nice information, much better than MemorySampler.
However, both only output information on completion. So in case of an unexpected OOM error, the slurm job is killed and the output won’t be saved.
Regarding your questions:
We use slurm, and I have asked but our HPC team didn’t know of any way to measure e.g. memory etc.
The error message:
File ".../lib/python3.12/site-packages/distributed/client.py", line 2403, in _gather
raise exception.with_traceback(traceback)
distributed.client.FutureCancelledError: finalize-hlgfinalizecompute-a81d95fafb194637b56e32bd23a7f975 cancelled for reason: scheduler-connection-lost.
Client lost the connection to the scheduler. Please check your connection and re-run your work.
PS: I also often see this non-fatal error, impacting monitoring (censored IP address):
2025-05-27 11:12:22,988 - distributed.scheduler - WARNING - Worker failed to heartbeat for 304s; removing: <WorkerState 'inproc://.../12162/15', name: 0, status: running, memory: 0, processing: 0>
2025-05-27 11:12:22,990 - distributed.scheduler - WARNING - Workers ['inproc://.../12162/15'] do not use a nanny and will be terminated without restarting them
2025-05-27 11:12:22,991 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'inproc://.../12162/15'.
2025-05-27 11:12:22,993 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.