Dask Memory Leak Workaround

ufosoftware · May 11, 2022, 12:40pm

0

When using the Dask dataframe where clause I get a “distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS” warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I’m reading can be found at Box. You have to read it in with Pandas as save it as a parquet file for Dask to read it.

from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client

from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')

client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()

#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()

pavithraes · May 11, 2022, 6:58pm

@ufosoftware Welcome!

I think read_parquet might be causing the MemoryError – Dask DataFrame operates lazily, which means the read_parquet is executed only when you call .compute(). I’d suggest using .persist() to make sure read_parquet works:

sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
sale_items_df.persist()

If this does raise a memory error, you can check out some optimization techniques given here: Dask Dataframe and Parquet — Dask documentation, like using the pyarrow engine, using appropriate partition sizes, and setting ignore_metadata_file=True.

This blog might also be helpful: Tackling unmanaged memory with Dask | Coiled : Coiled

ufosoftware · May 17, 2022, 1:05am

Thank you for your response. I tried all the suggestions; they delay the occurrence of the memory error and crash but they still occur.

matrixbegins · March 25, 2023, 5:59pm

@ufosoftware did you manage to make it work? I am facing this issue as well. I am reading few Json files from S3 bucket, and facing these issues on a sufficiently large machines (8 core 32 GB). Even a server upgrade from 16GB to 32GB didn’t work.

ufosoftware · March 25, 2023, 6:43pm

My issue had to do with using parquet files. Dask is not able to make accurate memory decisions when reading parquet files. I resolved the issue by converting the files to csv.

guillaumeeb · March 28, 2023, 9:57am

Hi there!

I’m quite surprise about this conclusion, falling back to CSV file don’t seem like the right answer to me. Maybe it works better because CSV reads are slower. There as been quite a lot of improvement on memory management on Dask side since, @ufosoftware did you try with newer versions?

@matrixbegins maybe you should open a new post to discuss your problem?

matrixbegins · March 28, 2023, 10:50am

@guillaumeeb
My problem have become quite a story now Let me open a new ticket I request you to not be overwhelmed by it.

ufosoftware · March 28, 2023, 3:59pm

My understanding is Dask is able to make better memory management decisions when reading csv files. I don’t remember the exact reason for this but it didn’t have anything to do with speed.
I was using the latest version at the time and I received a lot of help from a Dask expert.

guillaumeeb · March 28, 2023, 5:03pm

It would be nice if somehow you or someone else could elaborate on this, but I understand this is quite old for you.

Topic		Replies	Views
Memory Leakage on single worker on merged DataFrame (after task completion) Dask DataFrame delayed	5	390	October 6, 2023
Memory Leak on Dask Worker Distributed	4	562	July 20, 2022
Run Dask script from command line without warning messages Dask DataFrame	1	169	July 22, 2022
Memory leak in dask cluster Distributed kubernetes , distributed	5	1774	April 13, 2023
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1680	April 6, 2023

Dask Memory Leak Workaround

Related topics