Problem Description:
I’m working on a function using dask.delayed
to process a dataset by creating smaller subsets containing latitude, longitude, and a given timestamp. The function works as expected without dask.delayed
, returning the correct subset of data. However, when I introduce dask.delayed
, unexpected behavior occurs.
Unexpected Behavior:
- Excessive Computation: Instead of processing just the specified subset, it appears that Dask tries to compute the entire dataset.
- Kernel Crashes: When I attempt to create a visualization or inspect the computed data, the kernel crashes, making it impossible to gather further details.
- Premature Parallelism: Upon calling the
compute()
method, all threads seem to start processing immediately, even though the intention is to compute only a small part of the dataset (specifically, the values for a single grid cell at the first timestep).
What I Tried:
- Without
dask.delayed
, I verified that the function correctly retrieves the values forlatitude=0
,longitude=0
, andtimestep=0
. - When testing with
dask.delayed
, I expected the same result, but the output is different and causes the issues mentioned above.
Goal:
In the first step, I want to ensure that the function using dask.delayed
produces the same results as the non-delayed version for a single grid cell and a single timestep. Once confirmed, I plan to scale this approach to process the entire grid and multiple timesteps.
Request:
Has anyone experienced similar issues with dask.delayed
, or can anyone suggest why Dask might be trying to compute the entire dataset instead of just the specified subset? Any insights or debugging tips would be greatly appreciated.
The Code used is:
# load the data
xr_data= xr.open_dataset('data.grb'), engine='cfgrib', chunks={'step':1, 'number':1})
# create small parts of the data to perform some analysis
@dask.delayed
def load_grid_data(data, lat, lon, timestep):
grid_data = data.sel(latitude=lat, longitude=lon, step=timestep)
return (grid_data, lat, lon, timestep)