Tracking and Storing Memory Usage Per Task

Hi all,

I am looking for a way to track and store memory usage per task, including peak, min, and max memory usage per task. Similar to how one can use performance_metric to store various metrics to an html summary file, I’d also like to keep track of the memory usage of my tasks.

I’ve come across the package dask-memusage, but it only works when you have a cluster with workers running on a single-thread.

Are there any tools you guys recommend or any internal APIs I could use to achieve this?

Thank you!

Hi @mronda, welcome to Dask Discourse forum!

It is very hard to get memory usage per task if your Worker process has several threads, because on system level you get the memory usage of a process. I guess having the memory usage per task would mean computing the memory footprint of the Python structures used in this task, which sounds hard.

So I’m not sure this is feasible in a multi threaded context.

You can have a look to Fine Performance Metrics — Dask.distributed 2024.7.1 documentation for an API to record metrics, but I still don’t know how to get the numbers you want.

cc @martindurant @crusaderky.

Hi @guillaumeeb ,

Thank you for your response! Will look into those docs right now.

What do you think about this approach ? → I wrote a quick SchedulerPlugin that periodically pulls data from scheduler.workers.values() and stores metrics like memory_percent in a file. My goal is to monitor the total worker memory usage for each of my experiments, which will later help me fine-tune performance. I used this code for inspiration:

distributed/dashboard/components/scheduler.py#L4254

Dummy example running:

install(cluster.scheduler, "experiment_memory.csv")  

def test_function(i, size_in_gb =2):
    elements = (size_in_gb * (1024 ** 3)) // np.dtype(np.float64).itemsize
    large_array = np.random.random(elements )
    time.sleep(10)
    return i*2

f = [client.submit(test_function, i, size_in_gb=2, pure=False) for i in range(8)]
r = client.gather(f)

Creates a file like this:
image

Curious, is there a more efficient way to pull data from WorkerTable and store somewhere?

Thanks again for taking a look at this! :slight_smile:

This sounds like a nice solution. You also already have those Metrics exposed as a Prometheus endpoint: Prometheus monitoring — Dask.distributed 2024.7.1+10.gc44ad22 documentation.

cc @jacobtomlinson

Hi @guillaumeeb , thank you! Are there any examples pulling metrics from prometheus endpoint? And maybe in another topic ( I can ask in another post ) , can I publish custom metrics to prometheus endpoint ? Id always wanted to publish my own custom metrics to later visualize . I asked a long time to do something like this but got nowhere :smiley:

Thanks!

I will let @jacobtomlinson answer those last questions, because I’m not sure of the answers.

Getting back to memory profiling, I just came aware of the dask memray plugin in another new post!

2 Likes

Promtheus endpoints are just an HTTP endpoint with a specific format. Of course the most common way to collect that data is with Prometheus itself, but you could just peridically curl the endpoint and store the data however you like.

Make sure you have the prometheus_client package installed and then access <dask scheduler IP>:8787/metrics.

You could add more metrics via a scheduler plugin. We don’t have any documentation on this so you will need to look at how the current metrics are implemented.

2 Likes