0
When using the Dask dataframe where clause I get a “distributed.worker_memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS” warning. This happens until the system runs out of memory and swap. Is there a workaround to this or am I doing something wrong. The file I’m reading can be found at Box. You have to read it in with Pandas as save it as a parquet file for Dask to read it.
from dask import dataframe as dd
import dask.array as da
from dask.distributed import Client
from pathlib import Path
import os
file_path = Path('../cannabis_data_science/wa/nov_2021')
client = Client(n_workers=2, threads_per_worker=2, memory_limit = '15GB')
client
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
#this causes the warning
x = sale_items_df.description.where(sale_items_df.description.isna(), sale_items_df.name).compute()
@ufosoftware Welcome!
I think read_parquet
might be causing the MemoryError
– Dask DataFrame operates lazily, which means the read_parquet
is executed only when you call .compute()
. I’d suggest using .persist()
to make sure read_parquet
works:
sale_items_df = dd.read_parquet(path = file_path / 'SaleItems_1.parquet', blocksize = '100MB').reset_index()
sale_items_df.persist()
If this does raise a memory error, you can check out some optimization techniques given here: Dask Dataframe and Parquet — Dask documentation, like using the pyarrow
engine, using appropriate partition sizes, and setting ignore_metadata_file=True
.
This blog might also be helpful: Tackling unmanaged memory with Dask | Coiled : Coiled
Thank you for your response. I tried all the suggestions; they delay the occurrence of the memory error and crash but they still occur.
@ufosoftware did you manage to make it work? I am facing this issue as well. I am reading few Json files from S3 bucket, and facing these issues on a sufficiently large machines (8 core 32 GB). Even a server upgrade from 16GB to 32GB didn’t work.
My issue had to do with using parquet files. Dask is not able to make accurate memory decisions when reading parquet files. I resolved the issue by converting the files to csv.
Hi there!
I’m quite surprise about this conclusion, falling back to CSV file don’t seem like the right answer to me. Maybe it works better because CSV reads are slower. There as been quite a lot of improvement on memory management on Dask side since, @ufosoftware did you try with newer versions?
@matrixbegins maybe you should open a new post to discuss your problem?
@guillaumeeb
My problem have become quite a story now Let me open a new ticket I request you to not be overwhelmed by it.
My understanding is Dask is able to make better memory management decisions when reading csv files. I don’t remember the exact reason for this but it didn’t have anything to do with speed.
I was using the latest version at the time and I received a lot of help from a Dask expert.
It would be nice if somehow you or someone else could elaborate on this, but I understand this is quite old for you.