Dataset Computation with Dask Delayed in Subset Function, Leading to Kernel Crashes

Problem Description:

I’m working on a function using dask.delayed to process a dataset by creating smaller subsets containing latitude, longitude, and a given timestamp. The function works as expected without dask.delayed, returning the correct subset of data. However, when I introduce dask.delayed, unexpected behavior occurs.

Unexpected Behavior:

  1. Excessive Computation: Instead of processing just the specified subset, it appears that Dask tries to compute the entire dataset.
  2. Kernel Crashes: When I attempt to create a visualization or inspect the computed data, the kernel crashes, making it impossible to gather further details.
  3. Premature Parallelism: Upon calling the compute() method, all threads seem to start processing immediately, even though the intention is to compute only a small part of the dataset (specifically, the values for a single grid cell at the first timestep).

What I Tried:

  • Without dask.delayed, I verified that the function correctly retrieves the values for latitude=0, longitude=0, and timestep=0.
  • When testing with dask.delayed, I expected the same result, but the output is different and causes the issues mentioned above.

Goal:

In the first step, I want to ensure that the function using dask.delayed produces the same results as the non-delayed version for a single grid cell and a single timestep. Once confirmed, I plan to scale this approach to process the entire grid and multiple timesteps.

Request:

Has anyone experienced similar issues with dask.delayed, or can anyone suggest why Dask might be trying to compute the entire dataset instead of just the specified subset? Any insights or debugging tips would be greatly appreciated.


The Code used is:

# load the data 
xr_data= xr.open_dataset('data.grb'), engine='cfgrib', chunks={'step':1, 'number':1})

# create small parts of the data to perform some analysis
@dask.delayed
def load_grid_data(data, lat, lon, timestep):
    grid_data = data.sel(latitude=lat, longitude=lon, step=timestep)
    return (grid_data, lat, lon, timestep)

Hi @Eis-ba-er, welcome to Dask community!

Could you also share the code you use after declaring the function? For launching one or several computations?

I see no reason a Delayed function would produce a different result. But here, you are also mixing Delayed and Dask Array thourgh Xarray, that might be the reason. Is your grib file big?

I’m aware that a solution might not exist due to an issue with the library, but I wanted to share the problem I’m facing for better understanding. I’ve created a GitHub issue here to reproduce the problem.

In short, I’m trying to perform an analysis on ensemble members for each grid cell. The goal is to iterate over the entire grid, calculate a value for each cell, store the results in an array, and eventually save everything to a NetCDF4 file. The dataset I’m working with is quite large—at least 140 GB in total—and there are several similar datasets.

I’m using xarray to read the data in chunks, intending to create a parallelized analysis for each grid cell. However, I’m encountering issues with the combination of dask.delayed and xarray. Additionally, my first attempt to use concurrent.futures for parallelization seemed slower than expected.

Could you explain how this issue arises when using dask.delayed with xarray and how to handle parallelization for analyzing large datasets like this more efficiently?
So far, seeking advice from AI hasn’t improved the code or resolved the problem.

I noticed a similar concept mentioned in the Dask documentation under Delayed Best Practices, specifically at the section “Break up computations into many pieces.”

If required i can also upload further examplecode.

As answered in your opened issue: you don’t want to mix Delayed and Xarray generally. Xarray handles the code distribution and the usage of Dask for you.