Am I hitting a network bottleneck or is there any room for improvements?

I’m locally processing data coming from a GCS bucket. The idea is to compute an average over many years of data.
The code is really simple, something on the line of

import xarray as xr
from dask.distributed import Client
client = Client()

reanalysis = xr.open_dataset(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    backend_kwargs={
        "storage_options": {"token": 'anon'},
    },
    engine="zarr",
    chunks={'time': 24},
    consolidated=True,
)

radiation = reanalysis["surface_solar_radiation_downwards"].sel(
    time=slice("1991-01-01", "2020-12-31")
)

res = radiation.resample(time='1YE').sum().mean(dim='time').compute()

Leaving the settings untouched, dask chooses 8 workers with 4 threads each (I have a i9-13900F), which is reasonable.

When computing, CPU usage is not high, and I was trying to wonder why that is the case. This is a quick animation of the client web dashboard, which seems healthy but does not show many updates.

ScreenRecording2025-05-08at10.45.05-ezgif.com-video-to-gif-converter

Looking at the network bandwidth, I saw that I’m constantly hittting ~115 MB/s which is surprisingly close to my internet max speed (1000 Mb/s).

Does this mean that I cannot possibly make this computation faster as I’m hitting my banwidth limit and the workers cannot download data faster?

1 Like

Hi @guidocioni, welcome to Dask community!

Well, I think you already narrow it down. If you read data from Google cloud on your personal computer, network bandwith would probably be the bottleneck, and you have verified it.

If you are already reading only the data you need, which seems to be the case, the only solution is to do this computation in the cloud, preferably in the same zone where data is stored. You can either do it using dask-kubernetes or dask-cloudprovider, or try using Coiled or equivalent.

1 Like