Memory leak in dask cluster

patrik93 · March 28, 2023, 9:09am

Hi guys! I would like to have some advice about dask and memory leaks. I have a problem when after all the tasks are done, the memory remains really high on the workers.

Let me attach a graph about the memory. There you can see the total memory usage for all the workers (total 9) peak was 58.9 GiB and after the task was done it went down to 42.2 GiB.
This won’t be lower when i start my next workflow, it will stack up. I dont have any parquet or persist in use.

guillaumeeb · March 28, 2023, 10:14am

Hi @patrik93, welcome here!

Did you have a look at the Dashboard to see how your cluster memory is used?

I suspect this probably has to do with Unmanaged memory. There are some nice resources about it:
https://distributed.dask.org/en/stable/worker-memory.html#using-the-dashboard-to-monitor-memory-usage

This is a problem. Normally, Dask will be able to free some unused memory when the next job starts.

Could you produce some minimum reproducible example? Or at least give some snippets of your workflow?

patrik93 · March 28, 2023, 11:10am

Thank you for the quick response. Let me share a screenshot of the dashboard, there you can see its indeed unmanaged memory. Previously we have had some memory_trim but since we updated dask version to dask == 2023.3.1 and distributed == 2023.3.1 we removed that because we interpreted the documentation in a way it is not recommended anymore to use manual memory_trim, it is handled by dask. That’s why we ended up there.

I’m afraid I cannot give you a reproducible example at the moment, what I can promise we will try to add back some memory_trim into the crucial parts.
Can you confirm that this does not lead into any problem if we use it?

guillaumeeb · March 28, 2023, 4:58pm

Reading the documentation here:
https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os

It does say that using manual trim

should be only used as a one-off debugging experiment

But it also gives other way for automatic trim if this is indeed your problem.

patrik93 · April 12, 2023, 7:43am

Hi there,

It took a while I’m getting back to you with the results. We tried to apply manual trims to the end of our workflow, but it actually did not help either, the memory remains there.

guillaumeeb · April 13, 2023, 1:15pm

I’m sorry to hear that.

What happen if you try the workflow with less memory on the cluster? Maybe Python just doesn’t release memory because it doesn’t need to?

Else, without any reproducer, I’m not sure how we can help.

Topic		Replies	Views
Unable to remove unmanaged memory Distributed kubernetes , future , distributed	8	1048	May 10, 2023
Why my managed memory is zero or KB? Distributed kubernetes , distributed	5	111	April 5, 2024
Memory Leak on Dask Worker Distributed	4	573	July 20, 2022
Unmanaged memory high even after future collection Distributed	2	217	December 5, 2023
Why I get a lot of unmanaged memory? Distributed	27	3882	February 28, 2023

Memory leak in dask cluster

Related topics