Am I hitting a network bottleneck or is there any room for improvements?

guidocioni · May 9, 2025, 10:24am

I’m locally processing data coming from a GCS bucket. The idea is to compute an average over many years of data.
The code is really simple, something on the line of

import xarray as xr
from dask.distributed import Client
client = Client()

reanalysis = xr.open_dataset(
    "gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3",
    backend_kwargs={
        "storage_options": {"token": 'anon'},
    },
    engine="zarr",
    chunks={'time': 24},
    consolidated=True,
)

radiation = reanalysis["surface_solar_radiation_downwards"].sel(
    time=slice("1991-01-01", "2020-12-31")
)

res = radiation.resample(time='1YE').sum().mean(dim='time').compute()

Leaving the settings untouched, dask chooses 8 workers with 4 threads each (I have a i9-13900F), which is reasonable.

When computing, CPU usage is not high, and I was trying to wonder why that is the case. This is a quick animation of the client web dashboard, which seems healthy but does not show many updates.

ScreenRecording2025-05-08at10.45.05-ezgif.com-video-to-gif-converter

Looking at the network bandwidth, I saw that I’m constantly hittting ~115 MB/s which is surprisingly close to my internet max speed (1000 Mb/s).

Does this mean that I cannot possibly make this computation faster as I’m hitting my banwidth limit and the workers cannot download data faster?

guillaumeeb · May 9, 2025, 3:07pm

Hi @guidocioni, welcome to Dask community!

Well, I think you already narrow it down. If you read data from Google cloud on your personal computer, network bandwith would probably be the bottleneck, and you have verified it.

If you are already reading only the data you need, which seems to be the case, the only solution is to do this computation in the cloud, preferably in the same zone where data is stored. You can either do it using dask-kubernetes or dask-cloudprovider, or try using Coiled or equivalent.

Topic		Replies	Views
Is this a fair benchmarking approach? Dask Array	4	251	June 8, 2022
Bad performance with dask in k8s?	1	390	May 12, 2022
Where do dask clients fetch data from, scheduler or workers? Distributed dask-array	1	216	October 19, 2022
Dask Arrays with TensorFlow Dask Array dask-array , distributed	3	1147	August 5, 2022
Reading data from netcdf or zarr files loads all data into memory Distributed dask-array	4	86	December 18, 2024

Am I hitting a network bottleneck or is there any room for improvements?

Related topics