Failing on a simple example for to_hdf5

cyrilpic · May 22, 2024, 2:17am

Hi everyone,

I have been using dask.bags for years, but thought I would give dask.array a try. I came up with what I believe was a very simple example, and yet I get an expected “TypeError: h5py objects cannot be pickled” error.

Basically, I’m trying to generate some random numbers, calculate stuff in a distributed way and then store the results in a HDF5 file. Here is a minimal example that reproduces the issue.


from dask.distributed import Client
from dask.distributed import LocalCluster
import dask.array as da


def main():
    repeats = 3
    N = 1000

    with LocalCluster() as cluster:
        with Client(cluster) as client:
            print(client)
            print(client.scheduler_info())
            print(client.dashboard_link)

            rng = da.random.default_rng()
            omT = (
                da.array([7000.0, 1900.0])
                * rng.random((repeats * N, 2), chunks=(100, 2))
            ).round(1)
            theta = rng.random((N, 2, 10), chunks=(100, 2, 10))
            theta[:, 1, :] = (1 - theta[:, 0, :]) * theta[:, 1, :]
            theta = theta.round(4).reshape(N, -1)
            # Do more stuff

            da.to_hdf5("inputs.hdf5", {"/omT": omT, "/theta": theta})


if __name__ == "__main__":
    main()

I’d be happy if someone could help me understand what I’m doing wrong.
Thanks!

guillaumeeb · May 22, 2024, 2:45pm

Hi @cyrilpic, welcome to Dask community!

Actually, h5py does not support reading or writing from multiple process, so Dask to_hdf5 method does not work using a Distributed cluster. There are several issues opened about that:

github.com/dask/dask

Error pickling h5py objects with multiprocessing scheduler

opened 01:34AM - 26 Feb 22 UTC

scharlottej13

array io bug p3

**What happened**: This issue first came up on [discourse](https://dask.discour…se.group/t/h5py-objects-cannot-be-pickled-or-slow-processing/229). When using the multiprocessing scheduler with `h5py` objects, there is a `TypeError: h5py objects cannot be pickled`. This does not happen when using the distributed scheduler. **Minimal Complete Verifiable Example**: ```python import h5py import dask.array as da # create fake hdf5 for testing f = h5py.File("tmp/mytestfile.hdf5", "w") dset = f.create_dataset("mydataset", (1000, 3), dtype='i') # read it back in f = h5py.File('tmp/mytestfile.hdf5', 'r') dset = f['mydataset'] # send dask array result to map_blocks dask_array = da.from_array(dset, chunks=(dset.shape[0], 1)) doubled = dask_array.map_blocks(lambda x: x * 2) doubled.compute(scheduler='processes') ``` <details> <summary>Full traceback</summary> ```python-traceback --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /var/folders/hf/2s7qjx7j5ndc5220_qxv8y800000gn/T/ipykernel_8250/2047040186.py in <module> 17 18 # compute ---> 19 doubled.compute(scheduler='processes') ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/dask/base.py in compute(self, **kwargs) 288 dask.base.compute 289 """ --> 290 (result,) = compute(self, traverse=False, **kwargs) 291 return result 292 ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/dask/base.py in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs) 571 postcomputes.append(x.__dask_postcompute__()) 572 --> 573 results = schedule(dsk, keys, **kwargs) 574 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)]) 575 ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, pool, chunksize, **kwargs) 218 try: 219 # Run --> 220 result = get_async( 221 pool.submit, 222 pool._max_workers, ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/dask/local.py in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs) 492 # Main loop, wait on tasks to finish, insert new ones 493 while state["waiting"] or state["ready"] or state["running"]: --> 494 fire_tasks(chunksize) 495 for key, res_info, failed in queue_get(queue).result(): 496 if failed: ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/dask/local.py in fire_tasks(chunksize) 474 ( 475 key, --> 476 dumps((dsk[key], data)), 477 dumps, 478 loads, ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dumps(obj, protocol, buffer_callback) 71 file, protocol=protocol, buffer_callback=buffer_callback 72 ) ---> 73 cp.dump(obj) 74 return file.getvalue() 75 ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/cloudpickle/cloudpickle_fast.py in dump(self, obj) 600 def dump(self, obj): 601 try: --> 602 return Pickler.dump(self, obj) 603 except RuntimeError as e: 604 if "recursion" in e.args[0]: ~/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/h5py/_hl/base.py in __getnewargs__(self) 366 limitations, look at the h5pickle project on PyPI. 367 """ --> 368 raise TypeError("h5py objects cannot be pickled") 369 370 def __getstate__(self): TypeError: h5py objects cannot be pickled ``` </details> **Anything else we need to know?**: I tried the same with zarr and there was no error: ```python import zarr import dask.array as da # create fake zarr array zarr_array = zarr.zeros((1000, 3), chunks=(1000, 1), dtype='i') # convert to dask array dask_array = da.from_zarr(zarr_array, chunks=(zarr_array.shape[0], 1)) # apply function doubled = dask_array.map_blocks(lambda x: x * 2) # compute doubled.compute(scheduler='processes') ``` **Environment**: - Dask version: 2022.02.1 - Python version: 3.9 - Operating System: Mac - Install method (conda, pip, source): conda

github.com/dask/dask

Multiprocess writes using `to_hdf5`

opened 01:12AM - 17 Jan 18 UTC

jakirkham

array io

After seeing @stuartarchibald's [post on gitter]( https://gitter.im/dask/dask?at…=5a5cd598290a1f45618f91b3 ), I was a bit curious how `to_hdf5` actually worked and whether it would be viable to call from multiple processes. Turns out [`to_hdf5` just calls `store` underneath the hood]( https://github.com/dask/dask/blob/0.16.1/dask/array/core.py#L3390 ) while [holding the file open locally]( https://github.com/dask/dask/blob/0.16.1/dask/array/core.py#L3385 ). The biggest issue of course is that the file is held open in write mode on the scheduler (meaning nothing else can write to the HDF5 file without corrupting it). Since the type of locking was not specified and `store` uses a [`threading.Lock` by default]( https://github.com/dask/dask/blob/0.16.1/dask/array/core.py#L837-L840 ), this is also incompatible with the multiprocessing use case (i.e. `threading.Lock` cannot be serialized/locked across processes). IOW the current implementation of `to_hdf5` is not friendly even for locked parallel writes. That said, it should be feasible to change `to_hdf5`'s behavior to be more friendly for storage from multiple processes (though it will still need to be locked). In this case, it would still create the datasets initially (as it already does), but then would close the file. Instead of passing raw HDF5 Datasets as targets, a wrapper class would need to be used (to allow for pickling). The wrapper class would need to provide a `__setitem__` method that would open the HDF5 file and write to the HDF5 Dataset at the selection specified in a process safe manner (probably with `locket.lock_file`). Ideally it would provide a `__getitem__` method as well. Doing this should allow for HDF5 file to be written to in parallel. This assumes the filesystem is very robust and syncing changes between different nodes. An alternative to this proposal that would avoid locking would be to have data serialized back to the scheduler, which then writes each piece to the HDF5 file as it arrives. This would avoid the overhead of the previous strategy (and any potential issues locking) by guaranteeing only one process ever opens the HDF5 file and keeps open until everything is written. This strategy would continue to work well for non-parallel use cases with arguably less overhead than is present now. The only thing to do before writing out all results would be to optimize the graphs since `store` would no longer be used.

Unfortunately, this has not been solved, as this is not easy. Two things you might consider:

Use Xarray on top of Dask, which seems to handle this,
Use another multiprocessing format like Zarr.

Topic		Replies	Views
H5py objects cannot be pickled or slow processing Dask Array	2	1971	February 26, 2022
Reading data slices from multiple HDF5 files Distributed dask-array , distributed	5	1385	January 4, 2022
Serialization error when converting Dask Dataframe to Dask Array Dask DataFrame dask-array , distributed	2	1386	May 11, 2022
Pickling issues with dask delayed Distributed	3	213	July 17, 2024
Dask + Pennylane Distributed delayed	12	625	May 2, 2023

Failing on a simple example for to_hdf5

Related topics