Xarray + Dask Memory Overflow When Processing Large NC Datasets - Chunking Not Working

longkaihao · November 21, 2025, 10:08am

Hello everyone, I’m struggling with a persistent memory overflow issue when processing a large number of NetCDF files using Xarray and Dask. I’ve tried adjusting chunking strategies multiple times, but the memory usage keeps growing uncontrollably until the process crashes. I’d really appreciate any insights or solutions from the community!

1. Detailed Dataset & Use Case

Data Structure: I have 2920 NetCDF files. Each file has a consistent dimension structure: (1, lat, lon) (1 time step per file, full spatial coverage).
Spatial Dimensions: Fixed at 1800 (lat) × 3600 (lon) (high-resolution global grid), which means each file contains ~6.48 million data points.
Data Type: Float32 (4 bytes per value) for the main variable (e.g., “temperature” or “precipitation”).
Core Requirement: For each (lat, lon) grid point, I need to compute the Gini coefficient (a measure of data dispersion) using its full time series (all 2920 time steps). My workflow relies on a custom Gini function that takes a 1D time series as input and returns a single value. To ensure the function gets the complete time series for each grid point, I rechunk the dataset to keep the `time` dimension intact after reading.

2. Expected Chunking and Computation Logic

My ideal workflow is straightforward, but it’s failing with memory issues:

Concatenate 2920 files along `time` to get a dataset with shape `(2920, 1800, 3600)`.
Rechunk the dataset to `time: -1` (keep full 2920 steps for Gini calculation) and small spatial chunks (e.g., `lat=20, lon=20`)—so each chunk is `(2920, 20, 20)`.
Apply the custom Gini function to each chunk via `xr.apply_ufunc`, then write results to disk immediately (to avoid keeping data in memory).

In theory, if computation is evaluated with each chunk of a shape (2920, 20, 20) , no very much memory should be used by worker. But memory usage spirals out of control anyway, it seems that the scheduler first read in all data files, and then keep them in the memory before performing chunking, the custom function is invoked at last after chunking finished.

Task graph is just like follows:

3. My questions

Is there a way to make the workflow load only one spatial chunk (corresponding to all 2920 files), release its memory immediately after computing the Gini coefficient, and then move to the next spatial chunk?

Does `open_mfdataset` load entire NetCDF files into memory at once by default, even when chunking is specified? This might explain the persistent memory growth, but I can’t confirm it.

Hope I have described myself clearly. Thanks in advance.

guillaumeeb · November 28, 2025, 8:43am

Hi @longkaihao,

Well, this is a quite common but not that simple computing problem in the Dask/Xarray community. If your dataset is not chunked correctly, this kind of workflow might blow-up memory everytime. It’s not really a Dask problem, more a file format one.

So this comes from NetCDF, and to order to get only one pixel complete in time dimension, you’ll have to load all chunks…

There has been a lot of discussion and efforts in this domain through the Pangeo community. First rechunker, then cubed, and maybe flox or other:

longkaihao · November 29, 2025, 7:15am

Hi @guillaumeeb , thanks for sharing those threads, I would take a further dig.

Topic		Replies	Views
Climate Dataset : xarray & dask repeated fail Dask Array dask-array	7	292	May 13, 2024
Splitting big NetCDF file into hundreds of smaller files Distributed	1	103	November 22, 2024
Slow Performance with xarray+dask for Large Number of Small NetCDF Files Distributed dask-array , delayed	2	171	October 23, 2025
[New User] How to use dask to handle a large file from concatenation of many small files Dask Array	1	361	February 17, 2023
How to efficiently extract patches from a xarray/dask dataset? Distributed xarray , delayed	6	268	July 13, 2024

Xarray + Dask Memory Overflow When Processing Large NC Datasets - Chunking Not Working

1. Detailed Dataset & Use Case

2. Expected Chunking and Computation Logic

3. My questions

Related topics