Large graph warning

Hi there,

I am getting the following warning after I read in my data and execute any basic operation on it:

UserWarning: Sending large graph of size 336.99 MiB.

This may cause some slowdown.

Consider loading the data with Dask directly

or using futures or delayed objects to embed the data into the graph without repetition.

See also Dask Best Practices — Dask documentation for more information.

This warning appears even after I simply run

ddf = read_parquet(“path/to/datafrae/data.parquet”, filesystem=“arrow”)
ddf.count().compute()

The only thing that I can think of potentially being a problems is that my parquet dataset is partitioned very finely. But I get the error even when I specify the blocksize such that I only have ~200 partitions.

Thanks so much in advance!

Hi @julienberman, welcome to Dask community!

Could you be more precise on this partitionning? How many files do you have? You sure seem to have a larhe graph size, but considering your code this should happen only if you have way to much partitions. How many partitions does you DataFrame have in the beginning?

In the beginning, my dataframe has 50,000 partitions. But when I read the data such that the dataframe has only ~200 partitions, I still get the warning, albeit a slightly different number. In total, my dataframe is 14 GB. I am working on a cluster with 64 cores and 5 GiB of RAM per core.

Even 50,000 partitions shouldn’t generate such a big graph size. Is your code example the only thing you are doing, just read_parquet? Which Dask version are you using?

Here is my full code:

import dask.dataframe as dd
from dask import compute
from dask.distributed import Client, LocalCluster

def main():
  INDIR = Path("PATH_TO_DATA")
  client = setup_cluster()

  ddf = read_parquet(INDIR / "data.parquet", filesystem=“arrow”)
  print(ddf.count().compute())

def setup_cluster():
    cluster = LocalCluster(
        n_workers=64,
        threads_per_worker=1,
        memory_limit='5 GiB'
    )
    return Client(cluster)

if __name__ == "__main__":
    main()

My version is here:

(virtual-environemnt) [username]@[username]-MacBook-Pro home-directory % dask --version
dask, version 2025.5.1

Still not sure what’s going on… the dask dashboard says there are only ~100,000 tasks (I assume this is 1 read and 1 compute per partition).

Sorry for the late reply, I’m not sure either, but I still find it odd for 100k tasks.

What would help would be to have a complete MVCE, including a part whre you generate fake data for instance, with the same number of partitions.