Slow Performance with xarray+dask for Large Number of Small NetCDF Files

Hi everyone,

I’m currently working on processing a large number of small NetCDF files (each around 2MB, totaling roughly 120,000 files) using xarray and dask, but I’ve run into significant performance bottlenecks. I hope to get some insights or solutions from the community. Here are my specific issues:

1. Extreme Slowness When Opening Datasets & Unlinear Time Increase in dask.compute

First, using xarray.open_mfdataset to load these 120k files is extremely slow—the process takes far longer than I expected. To troubleshoot, I tried a workaround: writing custom dask.delayed functions to load individual files. However, I encountered two contrasting results:

  • When I passed all 120,000 delayed objects to dask.compute() at once, then used xr.concat() to combine them, the process was still painfully slow. Worse, the Dask Dashboard showed no sign of the computation starting for a long time.

  • When I split the 120,000 delayed functions into smaller batches, called dask.compute() on each batch sequentially, and then concatenated the results, the delay became relatively manageable.

This leads me to my first question:Why is open_mfdataset so slow when dealing with a huge number of small files? Additionally, why doesn’t the time cost of dask.compute() increase linearly with the number of delayed objects passed to it? It seems like the overhead grows much faster than the number of tasks.

2. Slow Basic Computations (min/max/mean) & Task Graph Bloat

Even after I managed to open the dataset with some workarounds, running simple computations like min(), max(), or mean() is still unusually slow. Again, the Dask Dashboard shows no sign of the computation starting for a long time.

I also frequently get warnings about the task graph being too large, which I suspect is related to the slowness. This brings me to my second question:What is the best practice for processing such a large number of small files? How can I elegantly construct a complete dataset so that top-level functions like mean() can handle the data efficiently, regardless of the total volume? Also, to avoid task graph bloat, is there a way to simplify the graph—for example, by merging multiple task nodes into one?

Any advice, experience sharing, or references to relevant resources would be greatly appreciated! Thanks in advance.

Hi @longkaihao, welcome to Dask community!

This is because, Xarray has to open each one of the file, get their metadata, and build an index with all the files dimensions and coordinates. There is an argument (I think parallel=True) to make this happend on Dask worker side, but it’s not magical either.

I’m not sure about that, but Scheduling tasks is a complex thing depending on the computation. The more burden you put on Dask Scheduler, the more time it will take to actually start a computation. This might not be O(n).

Plus, I’m not sure what do you mean by calling dask.compute(), do you want to load all the data? Maybe you should add some code snippet.

So what is slow: waiting for the computation to actually start and that you see it running on the Dashboard, or even when it is started?

The best way here would be to convert or rechunk your Dataset to avoid having such small files. You might want to look at tools like rechunker for that, and try to use Zarr with appropriate chunks according to your Dataset dimensions.

If this is not an option, you can look at kerchunk that will help you build a metadata file that will makes it much faster for opening your file. However, I don’t think it will solve the huge graph problem related to the number of files you have.

For this you would need an openmf_dataset that reads serveral files by task, but I don’t think it exists. I guess you could build some equivalent on your own…

Anyway, I still think the best way is to copy your dataset into another format and with bigger chunks, this is worth any other further computations you’d like to apply. There are also people with much more experience in these kind of data on Pangeo forum.

Hi @guillaumeeb , I really appreciate your patient reply.

I think I have understood more about the better practice under this condition with your help.

It seems that dask is not a silver bullet, but I have always considered using it to finish my work in one shot (just like opening 120k files, calculating complex indices, and outputting a final result directly) . This may be not practical, and may waste more time to find a workaround than using a simple loop. I should reduce the complexity of my tasks, or try to avoid handling such a lot of tiny files.

Also, I would try those useful tools you recommanded, thank you again. :grinning_face:

1 Like