Large graph warning

julienberman · July 15, 2025, 3:13am

Hi there,

I am getting the following warning after I read in my data and execute any basic operation on it:

UserWarning: Sending large graph of size 336.99 MiB.

This may cause some slowdown.

Consider loading the data with Dask directly

or using futures or delayed objects to embed the data into the graph without repetition.

See also Dask Best Practices — Dask documentation for more information.

This warning appears even after I simply run

ddf = read_parquet(“path/to/datafrae/data.parquet”, filesystem=“arrow”)
ddf.count().compute()

The only thing that I can think of potentially being a problems is that my parquet dataset is partitioned very finely. But I get the error even when I specify the blocksize such that I only have ~200 partitions.

Thanks so much in advance!

guillaumeeb · July 16, 2025, 12:16pm

Hi @julienberman, welcome to Dask community!

Could you be more precise on this partitionning? How many files do you have? You sure seem to have a larhe graph size, but considering your code this should happen only if you have way to much partitions. How many partitions does you DataFrame have in the beginning?

julienberman · July 17, 2025, 10:33pm

In the beginning, my dataframe has 50,000 partitions. But when I read the data such that the dataframe has only ~200 partitions, I still get the warning, albeit a slightly different number. In total, my dataframe is 14 GB. I am working on a cluster with 64 cores and 5 GiB of RAM per core.

guillaumeeb · July 18, 2025, 12:24pm

Even 50,000 partitions shouldn’t generate such a big graph size. Is your code example the only thing you are doing, just read_parquet? Which Dask version are you using?

julienberman · July 19, 2025, 8:12pm

Here is my full code:

import dask.dataframe as dd
from dask import compute
from dask.distributed import Client, LocalCluster

def main():
  INDIR = Path("PATH_TO_DATA")
  client = setup_cluster()

  ddf = read_parquet(INDIR / "data.parquet", filesystem=“arrow”)
  print(ddf.count().compute())

def setup_cluster():
    cluster = LocalCluster(
        n_workers=64,
        threads_per_worker=1,
        memory_limit='5 GiB'
    )
    return Client(cluster)

if __name__ == "__main__":
    main()

My version is here:

(virtual-environemnt) [username]@[username]-MacBook-Pro home-directory % dask --version
dask, version 2025.5.1

julienberman · July 29, 2025, 1:45am

Still not sure what’s going on… the dask dashboard says there are only ~100,000 tasks (I assume this is 1 read and 1 compute per partition).

guillaumeeb · August 6, 2025, 1:05pm

Sorry for the late reply, I’m not sure either, but I still find it odd for 100k tasks.

What would help would be to have a complete MVCE, including a part whre you generate fake data for instance, with the same number of partitions.

Topic		Replies	Views
Why does dask take long time to compute regardless of the size of dataframe and partitions Dask DataFrame	2	2477	April 1, 2022
Full garbage collections warnnings during the df.shape.compute Distributed dask-array	4	49	October 15, 2025
Distributed client on K8 OOM issue Distributed kubernetes	8	279	April 22, 2022
Ideal way to create parquet part files limited for size? parquet	4	296	August 16, 2024
Dask not distributing reading of parquet file? Distributed parquet , distributed	1	1750	April 6, 2023

Large graph warning

Related topics