Memory leak in dask cluster

Hi guys! I would like to have some advice about dask and memory leaks. I have a problem when after all the tasks are done, the memory remains really high on the workers.

Let me attach a graph about the memory. There you can see the total memory usage for all the workers (total 9) peak was 58.9 GiB and after the task was done it went down to 42.2 GiB.
This won’t be lower when i start my next workflow, it will stack up. I dont have any parquet or persist in use.

Hi @patrik93, welcome here!

Did you have a look at the Dashboard to see how your cluster memory is used?

I suspect this probably has to do with Unmanaged memory. There are some nice resources about it:
https://distributed.dask.org/en/stable/worker-memory.html#using-the-dashboard-to-monitor-memory-usage

This is a problem. Normally, Dask will be able to free some unused memory when the next job starts.

Could you produce some minimum reproducible example? Or at least give some snippets of your workflow?

Thank you for the quick response. Let me share a screenshot of the dashboard, there you can see its indeed unmanaged memory. Previously we have had some memory_trim but since we updated dask version to dask == 2023.3.1 and distributed == 2023.3.1 we removed that because we interpreted the documentation in a way it is not recommended anymore to use manual memory_trim, it is handled by dask. That’s why we ended up there.

I’m afraid I cannot give you a reproducible example at the moment, what I can promise we will try to add back some memory_trim into the crucial parts.
Can you confirm that this does not lead into any problem if we use it?

Reading the documentation here:
https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os

It does say that using manual trim

should be only used as a one-off debugging experiment

But it also gives other way for automatic trim if this is indeed your problem.

Hi there,

It took a while I’m getting back to you with the results. We tried to apply manual trims to the end of our workflow, but it actually did not help either, the memory remains there.

I’m sorry to hear that.

What happen if you try the workflow with less memory on the cluster? Maybe Python just doesn’t release memory because it doesn’t need to?

Else, without any reproducer, I’m not sure how we can help.