Hi everyone,
I’m currently working on processing a large number of small NetCDF files (each around 2MB, totaling roughly 120,000 files) using xarray and dask, but I’ve run into significant performance bottlenecks. I hope to get some insights or solutions from the community. Here are my specific issues:
1. Extreme Slowness When Opening Datasets & Unlinear Time Increase in dask.compute
First, using xarray.open_mfdataset to load these 120k files is extremely slow—the process takes far longer than I expected. To troubleshoot, I tried a workaround: writing custom dask.delayed functions to load individual files. However, I encountered two contrasting results:
-
When I passed all 120,000 delayed objects to
dask.compute()at once, then usedxr.concat()to combine them, the process was still painfully slow. Worse, the Dask Dashboard showed no sign of the computation starting for a long time. -
When I split the 120,000 delayed functions into smaller batches, called
dask.compute()on each batch sequentially, and then concatenated the results, the delay became relatively manageable.
This leads me to my first question:Why is open_mfdataset so slow when dealing with a huge number of small files? Additionally, why doesn’t the time cost of dask.compute() increase linearly with the number of delayed objects passed to it? It seems like the overhead grows much faster than the number of tasks.
2. Slow Basic Computations (min/max/mean) & Task Graph Bloat
Even after I managed to open the dataset with some workarounds, running simple computations like min(), max(), or mean() is still unusually slow. Again, the Dask Dashboard shows no sign of the computation starting for a long time.
I also frequently get warnings about the task graph being too large, which I suspect is related to the slowness. This brings me to my second question:What is the best practice for processing such a large number of small files? How can I elegantly construct a complete dataset so that top-level functions like mean() can handle the data efficiently, regardless of the total volume? Also, to avoid task graph bloat, is there a way to simplify the graph—for example, by merging multiple task nodes into one?
Any advice, experience sharing, or references to relevant resources would be greatly appreciated! Thanks in advance.