Failing on a simple example for to_hdf5

Hi everyone,

I have been using dask.bags for years, but thought I would give dask.array a try. I came up with what I believe was a very simple example, and yet I get an expected “TypeError: h5py objects cannot be pickled” error.

Basically, I’m trying to generate some random numbers, calculate stuff in a distributed way and then store the results in a HDF5 file. Here is a minimal example that reproduces the issue.


from dask.distributed import Client
from dask.distributed import LocalCluster
import dask.array as da


def main():
    repeats = 3
    N = 1000

    with LocalCluster() as cluster:
        with Client(cluster) as client:
            print(client)
            print(client.scheduler_info())
            print(client.dashboard_link)

            rng = da.random.default_rng()
            omT = (
                da.array([7000.0, 1900.0])
                * rng.random((repeats * N, 2), chunks=(100, 2))
            ).round(1)
            theta = rng.random((N, 2, 10), chunks=(100, 2, 10))
            theta[:, 1, :] = (1 - theta[:, 0, :]) * theta[:, 1, :]
            theta = theta.round(4).reshape(N, -1)
            # Do more stuff

            da.to_hdf5("inputs.hdf5", {"/omT": omT, "/theta": theta})


if __name__ == "__main__":
    main()

I’d be happy if someone could help me understand what I’m doing wrong.
Thanks!

Hi @cyrilpic, welcome to Dask community!

Actually, h5py does not support reading or writing from multiple process, so Dask to_hdf5 method does not work using a Distributed cluster. There are several issues opened about that:

Unfortunately, this has not been solved, as this is not easy. Two things you might consider:

  • Use Xarray on top of Dask, which seems to handle this,
  • Use another multiprocessing format like Zarr.
1 Like