Hello, I hope this is an appropriate venue for this question!
I am currently attempting to run a workflow that aims to convert coupled GCM output “history fields” in netCDF format to time-series zarr stores on cloud storage (Amazon S3 specifically). I’ve successfully run the code on a fairly large dataset on a jupyterhub and I was hoping to transfer the work into SLURM batch jobs. However, I’m running into seemingly random “OutOfBoundsDatetime” errors when running some of the jobs. Does anyone know how to diagnose this sort of error when using Dask/Zarr/xarray/fsspec?
Thank you!
ERROR MESSAGE:
2024-05-20 19:09:32,823 - distributed.worker - WARNING - Compute Failed
Key: ('open_dataset-concatenate-ce1061766bc0525459c914c7603c93fe', 4, 0, 0)
Function: getter
args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(_ElementwiseFunctionArray(LazilyIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x14f9d1809e80>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x14f9e8c21940>, encoded_fill_values={1e+36}, decoded_fill_value=nan, dtype=dtype('float32')), dtype=dtype('float32')), func=functools.partial(<function decode_cf_timedelta at 0x14f9e8c24f70>, units='seconds'), dtype=dtype('<m8[ns]')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(0, 1, None), slice(0, 192, None), slice(0, 288, None)))
kwargs: {}
Exception: 'OutOfBoundsDatetime("cannot convert input 85286404096.0 with the unit \'s\'")'
Below is the code section that produces the error:
# Load data
data = xr.open_mfdataset(f'{path}/{prefix}',combine='nested',concat_dim = timedim)
## Sets up DASK Delayed objects
## Creates empty zarr stores on s3 at this stages
l_pool_delayed = []
for var,zarr_store,mode,t0,varlist in self.to_do_dict[f'{domain}.{model}.{freq}']:
if self.verbose: print(f'Set up DASK delayed: {var}')
in_data = data[varlist + timevars].chunk(chunk).sel({timedim:slice(t0,'9999-12-31')})
mapper = fsspec.get_mapper(zarr_store,
client_kwargs={'region_name':self.s3_region},
check=False)
writeobj = in_data.to_zarr(store=mapper,mode=mode,compute=False)
l_pool_delayed.append(writeobj)
## Starts the dask processes!
print('BEGINNING DASK DISTRIBUTED COMPUTE')
with Client(threads_per_worker=self.nthread, n_workers=self.nworker) as client:
dask.compute(*l_pool_delayed)
A couple additional comments:
- I am setting up the SLURM job by calling the above python script with a bash script.
- When I run the same script on jupyterhub it completes without issue
- This occurs at different points in the computation when I attempt to re-do them. i.e., the error occurs when processing different variables.
- “85286404096.0” does not appear as a value anywhere in the dataset, as far as I can tell, so I’m not sure where this value is coming from.
- Each of the workers throws the same error, but the float value in the exception message is different in each (e.g., another error message says
Exception: 'OutOfBoundsDatetime("cannot convert input 85333753856.0 with the unit \'s\'")'
)