I’m locally processing data coming from a GCS bucket. The idea is to compute an average over many years of data.
The code is really simple, something on the line of
import xarray as xr
from dask.distributed import Client
client = Client()
reanalysis = xr.open_dataset(
"gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
backend_kwargs={
"storage_options": {"token": 'anon'},
},
engine="zarr",
chunks={'time': 24},
consolidated=True,
)
radiation = reanalysis["surface_solar_radiation_downwards"].sel(
time=slice("1991-01-01", "2020-12-31")
)
res = radiation.resample(time='1YE').sum().mean(dim='time').compute()
Leaving the settings untouched, dask
chooses 8 workers with 4 threads each (I have a i9-13900F), which is reasonable.
When computing, CPU usage is not high, and I was trying to wonder why that is the case. This is a quick animation of the client
web dashboard, which seems healthy but does not show many updates.
Looking at the network bandwidth, I saw that I’m constantly hittting ~115 MB/s which is surprisingly close to my internet max speed (1000 Mb/s).
Does this mean that I cannot possibly make this computation faster as I’m hitting my banwidth limit and the workers cannot download data faster?