Delayed functions memory leak by using pandas Dataframe

yesdslv · September 27, 2023, 4:34pm

Hi everyone

I am using dask distributed and faced problem of memory leak by using pandas dataframe in dask delayed function.

I create sample delayed functions to describe problem

I created 2 delayed functions and trim method

One function is reading file and append rows to list, in the end it print length of the list
Second function is doing the same but using pandas

First function

@delayed
def process_file(filename):
    lines = []
    with open(filename) as f:
        for row in f:
            lines.append(row)
    print(len(lines))

Second function

@delayed
def process(filename):
    df = pd.read_parquet(filename)
    rows = []
    print(df.info())
    for row in df.iterrows():
        rows.append(row)
    print(len(rows))

Trim method

import ctypes
def trim_memory() -> int:
    libc = ctypes.CDLL("libc.so.6")
    return libc.malloc_trim(0)

calling client

from dask.distributed import Client
cluster = "tcp://localhost:18786"
client = Client(cluster)
a = process('path_to_file')
f = client.compute([a])

try:
    for done_work in as_completed(f, with_results=True, raise_errors=True):
        pass
except Exception as e:
    client.cancel(f)
finally:
    del f

Before running tasks memory state of worker is 125.12MiB

I launched task for first function without using pandas for reading big file and memory state became is
510.70 MiB

After that I run first function for reading small file and memory stays the almost the same
511.30 MiB

I understand it is memory management in Linux. Or as said in this article

high water behavior

After that I called trim method and memory returned to initial stage 127.67 MiB

I run pandas couple times and memory usage became 224.07 MiB

Calling trimming did not return back to the initial value, worker still use 193.59MiB

I tried to memory profiling, but I couldn’t find something that give me a source of this leakage.
Or maybe I am doing something wrong and using dask in not correct way?

My setup
dask scheduler with parameters

scheduler --host localhost --port 18786 --dashboard --dashboard-address 18787 --protocol tcp

dask worker

worker --host localhost --nthreads 1 --nworkers 1 --name worker-one-sample --memory-limit 1024Mib --dashboard --dashboard-address 18788 --protocol tcp --local-directory /tmp tcp://127.0.0.1:18786

OS Ubuntu 22.04, Python 3.9.16, dask and distributed 2023.9.2

Warm regards

guillaumeeb · September 28, 2023, 9:10am

Hi @yesdslv, welcome to Dask discourse forum!

So I generated a CSV file using this code:

with open(my_big_csv.csv', 'w') as csvfile:
    for i in range(10_000_000):
        csvfile.write("value1,value2,value3,4,5,6\n")

and tried to play with your example.

With that, I cannot even run process delayed function on an 8GiB Worker. The object returned by df.iterrows seem to fill up the worker memory.

So the two pieces of code are not equivalent, one is using much more memory than the other.

Anyway, I’m not sure if all that is worth it, Python memory management is really complex, do you really need such a level of precision?

yesdslv · September 29, 2023, 8:40am

Hi,
Thank you, glad to be here.

Please decrease number of rows in file:

with open(my_big_csv.csv', 'w') as csvfile:
    for i in range(80_000):
        csvfile.write("value1,value2,value3,4,5,6\n")

Yes, they are not equivalent. My point is to demonstrate problem of dask and pandas interaction, pandas dataframe and series references are hold in worker after task completion.

I tried

github.com/dask/distributed

Worker memory not being freed when tasks complete

opened 03:59PM - 07 Jun 19 UTC

TomAugspurger

I'm still investigating, but in the meantime I wanted to get this issue started.… I'm noticing that after executing a task graph with large inputs and a small output, my worker memory stays high. In the example below we 1. Generate data (large byte strings) 2. filter data (slice) 3. reduce many tasks (sum) So the final result returned to the client is small, a single Python int. The only large objects should be the initially generated bytestrings. The console output below is 1. per-worker memory usage before the computation (~30 MB) 2. per-worker memory usage right after the computation (~ 230 MB) 3. per-worker memory usage 5 seconds after, in case things take some time to settle down. (~ 230 MB) ``` Memory usage [before] {'tcp://192.168.7.20:50533': '30.92 MB', 'tcp://192.168.7.20:50534': '30.95 MB'} running Memory usage [after] {'tcp://192.168.7.20:50533': '231.97 MB', 'tcp://192.168.7.20:50534': '232.63 MB'} Memory usage [after] {'tcp://192.168.7.20:50533': '232.05 MB', 'tcp://192.168.7.20:50534': '232.63 MB'} ``` In an effort to test whether the scheduler or worker is holding a reference to the data, I submit a bunch of tiny `inc` tasks to one of the worker. I notice that the memory on that worker does settle down ``` Memory usage [final] {'tcp://192.168.7.20:52114': '232.77 MB', 'tcp://192.168.7.20:52115': '49.73 MB'} ``` That's at least consistent with the worker or scheduler holding a reference to the data, but there could be many other causes. I'm still debugging. The number of `inc` tasks, 2731, seems to be significant. With 2730 `inc` tasks, I don't see any memory reduction on that worker. <details> ```python import time from dask.utils import parse_bytes, format_bytes import pprint import string import toolz from distributed import Client, wait N = parse_bytes("100 Mb") I = 20 def inc(x): return x + 1 def f(x, n=N): time.sleep(0.05) return string.ascii_letters[x % 52].encode() * n def g(x): time.sleep(0.02) return x[:5] def h(*args): return sum(x[0] for x in args) def get_mem(dask_worker): return dask_worker.monitor.proc.memory_info().rss def main(): dsk = {} for i in range(I): dsk[f'a-{i}'] = (f, i, N) dsk[f'b-{i}'] = (g, f'a-{i}') dsk['c-0'] = (h,) + tuple(f'b-{i}' for i in range(I)) with Client(n_workers=2, threads_per_worker=1, memory_limit='500Mb', processes=True) as client: print("Memory usage [before]") pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem))) print("running") client.get(dsk, keys=["c-0"]) time.sleep(2) # let things settle print("Memory usage [after]") pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem))) time.sleep(5) # settle some more? print("Memory usage [after]") pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem))) print("clear things?") futures = client.map(inc, range(2731), pure=False) wait(futures) del futures print("Memory usage [final]") pprint.pprint(toolz.valmap(format_bytes, client.run(get_mem))) if __name__ == '__main__': main() ``` </details>

But it still gradually increasing.

My colleague has a bigger experience working with dask. She said it is better not to use pandas within delayed functions. So it is what it is

crusaderky · October 2, 2023, 11:54am

When you first run something on a worker, all the necessary global state - namely, module imports - will be loaded as a one-off and not released when the tasks end.
This is a deliberate design feature of any long-running Python service - in the extremely likely event that you run something else afterwards that requires the same modules, they will already be in memory.

This isn’t a leak. A leak is when you run your task thousands of times, release the output after every iteration, and you see the baseline memory usage steadily increase.
However, please read: Mild memory leak in dask workers · Issue #8164 · dask/distributed · GitHub

Topic		Replies	Views
Memory leak with `@dask.delayed` Dask DataFrame distributed	3	171	February 2, 2024
Memory Leakage on single worker on merged DataFrame (after task completion) Dask DataFrame delayed	5	393	October 6, 2023
Dask Memory Leak Workaround Dask DataFrame	8	2535	March 28, 2023
Best method to create a Dataframe with calculated data added to it Dask DataFrame	2	312	April 9, 2022
Memory limits reached in simple ETL-like data transformations Dask DataFrame worker	14	2503	March 30, 2023

Delayed functions memory leak by using pandas Dataframe

Related topics