Full garbage collections warnnings during the df.shape.compute

Hi,

I ran the following code in a 16core 256GB machine:

from dask import dataframe as dd

df = dd.read_parquet("dataset_of_17GB/parquet/") # there are 3k+ files under the directory 

df = df.repartition(n_cores*3).persist()

df.shape[0].compute()


And I got full pages of warnings like these, and the computation seems will run forever:

2025-10-10 20:08:55,277 - distributed.utils_perf - WARNING - full garbage collections took 15% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,304 - distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,368 - distributed.utils_perf - WARNING - full garbage collections took 20% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,378 - distributed.utils_perf - WARNING - full garbage collections took 22% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,392 - distributed.utils_perf - WARNING - full garbage collections took 18% CPU time recently (threshold: 10%)
2025-10-10 20:08:55,483 - distributed.utils_perf - WARNING - full garbage collections took 23% CPU time recently (threshold: 10%)

Why did such a simple command take so many memories and so slow? How can I configure the localcluster to optimize it ??

Hi @tongxin.wen,

A bit hard to answer without a real dataset or a generated one that triggers this behavior.

Some thoughts:

  • If your Parquet files don’t have the right metadata, df.shape might need to read all the partitions in memory in order to compute their lengths.
  • Do you know how much partitions do you end up with? Maybe a lot of Python object are loaded and discarded fast for this operation, it does not mean a lot of memory is used.
  • Did you look at the Dashboard while performing the operation?
  • How is currently configured your LocalCluster?

hi @guillaumeeb yes, I understand. But just think it in a very basic way, the computation was much slower than a simple solution of a few lines of code that use the multiprocess pool to read, count and accumulate on the 3000+ parquet files, right ? Such simple solution will absolutely not use so many memories

One potential problem is there is indeed no metadata in the directory, that may be the key, I will check it later.

The localcluster was created only on the default settings. The final partitions number will be determined by which factor, the files count ?

I didn’t look at the dashboard, can u tell me which part should I focus on or can u give me more clues to debug on such issue ?

After some trials, I found it was caused by the persisit op in my code

After I removed the persist, the gc warnings becomes much less than before.

Oh, I didn’t saw the repartition and persist operations earlier…

Repartition is also expensive, be careful if you really need this. And yes, persist will try to fir all your data into memory which can be expensive.