Persistent memory profiling/logging

Hi, as part of debugging large Dask image operations run on HPC, I’m looking for a way to do real-time logging of (cluster) memory.

I’ve tried a number of options using the functionality provided by Dask:

  • The Dask dashboard (using SSH tunnelling): this is really nice, but it seems unstable, timing out at some point, also I don’t want to monitor this over many hours. It doesn’t help the tunnel needs 2 jumps.
  • MemorySampler: This is logging exactly what I need, but it seems unstable over a long time, failing the entire HPC job. Moreover if an unexpected OOM occurs, then results are not stored.
  • Expose logging by lowering the ‘silence’ level to info. I haven’t found any way to get memory information this way.

So ideally I’m looking for something similar to what MemorySampler does, periodically monitoring memory use, but outputting this real-time e.g. appending entries to a text file periodically, ideally without having to create a separate thread to explicitly query this periodically.

Does this functionality exist? What would be the best way to do this?
I’ve gone through a lot of documentation and blogs also specifically on HPC, but was not able to find this, so any help much appreciated.

Hi @folterj, welcome to Dask Discourse forum!

Should not be unstable, but anyway, did you take a look at Performance reports?

Is it? What kind of unstability do you have?

In addition to what you already pointed, you might take a look a Fine performance metrics, but not sure if there is what you want in it.

Another idea would be to use Prometheus endpoint, with Grafana or a correct client, you could probably get all the information you want there in real time.

What HPC scheduler system are you using? There are usually APIs or modules to check and log this kind of resources along job on them, did you ask the administrator of this system?

1 Like

Hi @guillaumeeb , thank you for this info!
The Performance reports give nice information, much better than MemorySampler.
However, both only output information on completion. So in case of an unexpected OOM error, the slurm job is killed and the output won’t be saved.

Regarding your questions:
We use slurm, and I have asked but our HPC team didn’t know of any way to measure e.g. memory etc.

The error message:

  File ".../lib/python3.12/site-packages/distributed/client.py", line 2403, in _gather
    raise exception.with_traceback(traceback)
distributed.client.FutureCancelledError: finalize-hlgfinalizecompute-a81d95fafb194637b56e32bd23a7f975 cancelled for reason: scheduler-connection-lost.
Client lost the connection to the scheduler. Please check your connection and re-run your work.

PS: I also often see this non-fatal error, impacting monitoring (censored IP address):

2025-05-27 11:12:22,988 - distributed.scheduler - WARNING - Worker failed to heartbeat for 304s; removing: <WorkerState 'inproc://.../12162/15', name: 0, status: running, memory: 0, processing: 0>
2025-05-27 11:12:22,990 - distributed.scheduler - WARNING - Workers ['inproc://.../12162/15'] do not use a nanny and will be terminated without restarting them
2025-05-27 11:12:22,991 - distributed.scheduler - WARNING - Received heartbeat from unregistered worker 'inproc://.../12162/15'.
2025-05-27 11:12:22,993 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.

I guess you could use some system commands that you could start at the beginning of each job, like dstat or equivalent.

This might be several things: network error, Scheduler or client overload, or any resource lacking.

This also looks like some network error or overload Scheduler or Worker…

1 Like

Thanks for the feedback on this. The scheduler/connection issues were indeed most likely caused by Client/Cluster scheduler overload by incorrect loading of data into Dask.

Learnings for this use-case:

  • Correctly loading data into Dask is of critical importance.
  • Don’t use Client/Cluster. The default threaded Dask scheduler appears to perform significantly better than a Client/Cluster with equivalent threaded single-process settings. Testing on a small dataset (~18 GB) on HPC the default Dask scheduler was 3 times as fast (wall time) compared to using Client/Cluster. This was in line with significantly higher utilisation of CPU cores observed.
  • Not using Cluster means the Dask profiling is not accessible unfortunately. An alternative for basic monitoring or trouble shooting is to use standard os level diagnostics like htop. I’m not sure what the mentioned Prometheus endpoint involves or whether it also relies on a Cluster scheduler.

Sure! Data chunking is not straightforward but really important, see Choosing good chunk sizes in Dask.

This is definitly not a general conclusion, and it is even really strange. In some cases, Distributed/multi processing scheduler can be slower because they add some burden of data serialization between processes and so on. 3 times looks really high to me! Maybe you should better analyze your workflow, how much time do your processing takes? Are you really loading only the data you need in the multi processing case?

It does rely on Worker processes. However, you have some diagnostics in local scheduler: Diagnostics (local) — Dask documentation.

Thank you for the additional insights.
Agreed that such a large difference (at least in wall time) seems unexpected - as mentioned in the context of my particular case.

Coming back to the original question, thank you also for the local scheduler diagnostics reference. Although this doesn’t provide real-time logging (in case of a crash), the ResourceProfiler (Diagnostics (local) — Dask documentation) looks really good for this. I’ve marked it as solution.
Thank you again.

1 Like