Hello !
Similar to my previous post, I’m processing a thousand of heavy HDF5 files for some data analysis. I’m doing it with Dask.
My issue being the way Dask.distributed handles HDF5 files. I looked across the different gihub issues trying to understand how this works.
So the thing is that when I create a dask oriented function to read the slices concurrently, let say like that :
def stack_vars(file):
with h5py.File(file, 'r') as t_hdf5:
t_base = t_hdf5['Base']
dask_arrays = list()
for key in snap_dict.items():
dset = t_base[key[1]]
array = da.from_array(dset)
array = array[::st_k,y_i,::st_i]
dask_arrays.append(array.T)
stack = da.stack(dask_arrays, axis=0)
return stack
And then I map this function to my serie of HDF5 files, like below :
x_t_fut = client.map(stack_vars, plot_files)
To finally recover the final (not so big i.e. ~ 20Go) data in memory :
x_t = client.gather(x_t_fut)
I end up with an error like this :
TypeError: Could not serialize object of type Array.
Traceback (most recent call last):
File "/linkhome/rech/genenm01/rgrs001/.conda/envs/jupyter_dask/lib/python3.8/site-packages/distributed/protocol/pickle.py", line 49, in dumps
result = pickle.dumps(x, **dump_kwargs)
File "/linkhome/rech/genenm01/rgrs001/.local/lib/python3.8/site-packages/h5py/_hl/base.py", line 372, in __getnewargs__
raise TypeError("h5py objects cannot be pickled")
TypeError: h5py objects cannot be pickled
So I would ask if anyone performed this kind of operation in a Dask distributed framework and what are the actual best practices to handle several HDF5 files in parallel through h5py (I heard about xarray but it is not really appropriate in my case)
Cheers