Full garbage collections warnnings during the df.shape.compute

tongxin.wen · October 10, 2025, 12:12pm

Hi,

I ran the following code in a 16core 256GB machine:

from dask import dataframe as dd

df = dd.read_parquet("dataset_of_17GB/parquet/") # there are 3k+ files under the directory 

df = df.repartition(n_cores*3).persist()

df.shape[0].compute()

And I got full pages of warnings like these, and the computation seems will run forever:

2025-10-10 20:08:55,277 - distributed.utils_perf - WARNING - full garbage collections took 15% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,304 - distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,368 - distributed.utils_perf - WARNING - full garbage collections took 20% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,378 - distributed.utils_perf - WARNING - full garbage collections took 22% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,392 - distributed.utils_perf - WARNING - full garbage collections took 18% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,483 - distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)

Why did such a simple command take so many memories and so slow? How can I configure the localcluster to optimize it ??

guillaumeeb · October 10, 2025, 1:08pm

Hi @tongxin.wen,

A bit hard to answer without a real dataset or a generated one that triggers this behavior.

Some thoughts:

If your Parquet files don’t have the right metadata, df.shape might need to read all the partitions in memory in order to compute their lengths.
Do you know how much partitions do you end up with? Maybe a lot of Python object are loaded and discarded fast for this operation, it does not mean a lot of memory is used.
Did you look at the Dashboard while performing the operation?
How is currently configured your LocalCluster?

tongxin.wen · October 10, 2025, 2:40pm

hi @guillaumeeb yes, I understand. But just think it in a very basic way, the computation was much slower than a simple solution of a few lines of code that use the multiprocess pool to read, count and accumulate on the 3000+ parquet files, right ? Such simple solution will absolutely not use so many memories

One potential problem is there is indeed no metadata in the directory, that may be the key, I will check it later.

The localcluster was created only on the default settings. The final partitions number will be determined by which factor, the files count ?

I didn’t look at the dashboard, can u tell me which part should I focus on or can u give me more clues to debug on such issue ?

tongxin.wen · October 11, 2025, 10:32am

After some trials, I found it was caused by the persisit op in my code

After I removed the persist, the gc warnings becomes much less than before.

guillaumeeb · October 15, 2025, 9:17am

Oh, I didn’t saw the repartition and persist operations earlier…

Repartition is also expensive, be careful if you really need this. And yes, persist will try to fir all your data into memory which can be expensive.

Topic		Replies	Views
Large graph warning Dask DataFrame	6	181	August 6, 2025
Why does dask take long time to compute regardless of the size of dataframe and partitions Dask DataFrame	2	2464	April 1, 2022
Dask Memory Leak Workaround Dask DataFrame	8	2606	March 28, 2023
Memory issues arising from writing partitions with to_parquet	5	871	September 18, 2023
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1745	April 6, 2023

Full garbage collections warnnings during the df.shape.compute

Related topics