Why is distributed leaking memory from unused futures?

mvashishtha · August 9, 2022, 2:47pm

@SultanOrazbayev saving and manually deleting the futures works for me. However, in general Python doesn’t require users to free unnamed objects. I think dask should take care of deleting futures that the client can never use. The memory management documentation linked above says:

The result of a task is kept in memory if either of the following conditions hold:

A client holds a future pointing to this task. The data should stay in RAM so that the client can gather the data on demand.

The task is necessary for ongoing computations that are working to produce the final results pointed to by futures. These tasks will be removed once no ongoing tasks require them.

But I would not expect the client to hold any future pointing to the unnamed and unused result of client.submit(lambda df: df + 1, df1_future). Why should the user be responsible for memory management of a future that they cannot (as far as I can tell) access?

Using temporary, unnamed objects this way is actually a very common pattern in an interactive context. e.g. in a notebook cell I might have:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(int(2**20), 2**8)))
print(df + 1)

Python constructs the dataframe from df1 + 1 and prints it, but then takes care of deleting the anonymous dataframe. Every time I repeat the print line, Python’s memory usage on my mac goes up by 2 GB, but then it goes down again immediately after.

Topic		Replies	Views
Unable to remove unmanaged memory Distributed kubernetes , future , distributed	8	1150	May 10, 2023
Cleaning Up Errored Futures Distributed	2	66	January 24, 2025
Unmanaged memory high even after future collection Distributed	2	248	December 5, 2023
Understanding Client.submit and fire_and_forget Distributed	2	579	March 10, 2023
Odd refcount issue distributed	3	46	August 28, 2024

Why is distributed leaking memory from unused futures?

Related topics