Tracking and Storing Memory Usage Per Task

mronda · July 29, 2024, 6:39pm

Hi all,

I am looking for a way to track and store memory usage per task, including peak, min, and max memory usage per task. Similar to how one can use performance_metric to store various metrics to an html summary file, I’d also like to keep track of the memory usage of my tasks.

I’ve come across the package dask-memusage, but it only works when you have a cluster with workers running on a single-thread.

Are there any tools you guys recommend or any internal APIs I could use to achieve this?

Thank you!

guillaumeeb · July 31, 2024, 8:02am

Hi @mronda, welcome to Dask Discourse forum!

It is very hard to get memory usage per task if your Worker process has several threads, because on system level you get the memory usage of a process. I guess having the memory usage per task would mean computing the memory footprint of the Python structures used in this task, which sounds hard.

So I’m not sure this is feasible in a multi threaded context.

You can have a look to Fine Performance Metrics — Dask.distributed 2024.7.1 documentation for an API to record metrics, but I still don’t know how to get the numbers you want.

cc @martindurant @crusaderky.

mronda · July 31, 2024, 2:40pm

Hi @guillaumeeb ,

Thank you for your response! Will look into those docs right now.

What do you think about this approach ? → I wrote a quick SchedulerPlugin that periodically pulls data from scheduler.workers.values() and stores metrics like memory_percent in a file. My goal is to monitor the total worker memory usage for each of my experiments, which will later help me fine-tune performance. I used this code for inspiration:

distributed/dashboard/components/scheduler.py#L4254

Dummy example running:

install(cluster.scheduler, "experiment_memory.csv")  

def test_function(i, size_in_gb =2):
    elements = (size_in_gb * (1024 ** 3)) // np.dtype(np.float64).itemsize
    large_array = np.random.random(elements )
    time.sleep(10)
    return i*2

f = [client.submit(test_function, i, size_in_gb=2, pure=False) for i in range(8)]
r = client.gather(f)

Creates a file like this:

Curious, is there a more efficient way to pull data from WorkerTable and store somewhere?

Thanks again for taking a look at this!

guillaumeeb · July 31, 2024, 3:00pm

This sounds like a nice solution. You also already have those Metrics exposed as a Prometheus endpoint: Prometheus monitoring — Dask.distributed 2024.7.1+10.gc44ad22 documentation.

cc @jacobtomlinson

mronda · July 31, 2024, 3:46pm

Hi @guillaumeeb , thank you! Are there any examples pulling metrics from prometheus endpoint? And maybe in another topic ( I can ask in another post ) , can I publish custom metrics to prometheus endpoint ? Id always wanted to publish my own custom metrics to later visualize . I asked a long time to do something like this but got nowhere

Thanks!

guillaumeeb · July 31, 2024, 3:50pm

I will let @jacobtomlinson answer those last questions, because I’m not sure of the answers.

Getting back to memory profiling, I just came aware of the dask memray plugin in another new post!

jacobtomlinson · July 31, 2024, 4:17pm

Promtheus endpoints are just an HTTP endpoint with a specific format. Of course the most common way to collect that data is with Prometheus itself, but you could just peridically curl the endpoint and store the data however you like.

Make sure you have the prometheus_client package installed and then access <dask scheduler IP>:8787/metrics.

You could add more metrics via a scheduler plugin. We don’t have any documentation on this so you will need to look at how the current metrics are implemented.

Topic		Replies	Views
API to access diagnose dashboard data	2	184	May 11, 2022
Distributed performance report of native code Distributed	3	42	August 1, 2024
Memory leak in dask cluster Distributed kubernetes , distributed	5	1848	April 13, 2023
Memory management in multi process vs thread-based parallelism Distributed	3	560	December 17, 2021
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1454	March 13, 2024

Tracking and Storing Memory Usage Per Task

Related topics