Hello everyone, I’m struggling with a persistent memory overflow issue when processing a large number of NetCDF files using Xarray and Dask. I’ve tried adjusting chunking strategies multiple times, but the memory usage keeps growing uncontrollably until the process crashes. I’d really appreciate any insights or solutions from the community!
1. Detailed Dataset & Use Case
-
Data Structure: I have 2920 NetCDF files. Each file has a consistent dimension structure:
(1, lat, lon)(1 time step per file, full spatial coverage). -
Spatial Dimensions: Fixed at 1800 (lat) × 3600 (lon) (high-resolution global grid), which means each file contains ~6.48 million data points.
-
Data Type: Float32 (4 bytes per value) for the main variable (e.g., “temperature” or “precipitation”).
-
Core Requirement: For each (lat, lon) grid point, I need to compute the Gini coefficient (a measure of data dispersion) using its full time series (all 2920 time steps). My workflow relies on a custom Gini function that takes a 1D time series as input and returns a single value. To ensure the function gets the complete time series for each grid point, I rechunk the dataset to keep the `time` dimension intact after reading.
2. Expected Chunking and Computation Logic
My ideal workflow is straightforward, but it’s failing with memory issues:
-
Concatenate 2920 files along `time` to get a dataset with shape `(2920, 1800, 3600)`.
-
Rechunk the dataset to `time: -1` (keep full 2920 steps for Gini calculation) and small spatial chunks (e.g., `lat=20, lon=20`)—so each chunk is `(2920, 20, 20)`.
-
Apply the custom Gini function to each chunk via `xr.apply_ufunc`, then write results to disk immediately (to avoid keeping data in memory).
In theory, if computation is evaluated with each chunk of a shape (2920, 20, 20) , no very much memory should be used by worker. But memory usage spirals out of control anyway, it seems that the scheduler first read in all data files, and then keep them in the memory before performing chunking, the custom function is invoked at last after chunking finished.
Task graph is just like follows:
3. My questions
Is there a way to make the workflow load only one spatial chunk (corresponding to all 2920 files), release its memory immediately after computing the Gini coefficient, and then move to the next spatial chunk?
Does `open_mfdataset` load entire NetCDF files into memory at once by default, even when chunking is specified? This might explain the persistent memory growth, but I can’t confirm it.
Hope I have described myself clearly. Thanks in advance.
