Hello everyone!
I have a problem that I don’t know how to solve, so I come here looking for help
Basically, I have data in a zarr file and want to read it with Dask and process it. For that, I have written the following code:
def preprocess_count_matrix(x, normalization):
if normalization == DISS_NORMALIZATION:
return x.map_blocks(
process_data_main,
max_len=2000,
dtype='f4',
)
elif normalization == 'raw':
return x
else:
raise ValueError(f'NORMALIZATION not found')
XX = preprocess_matrix(da.from_zarr(join(DATA_PATH, split, f'zarr_{i}'), 'X'), NORMALIZATION)
The function process_data_main
basically converts the data, no matter what, into an array whose second dimension is 2000. However, if my initial zarr file is size, let’s say (n x 500)
, XX is still size (n x 500)
for Dask, instead of (n x 2000)
, even though I can call .compute()
and verify that is indeed 2000.
This is a problem cause later I want to concatenate that XX to other arrays that must be, also, size (n x 2000)
but can’t do it since Dask still thinks is the previous size. How can I fix this? The meta
argument, as far as I know, cannot be used to set the shape of the output. I can always do .compute()
or .persist()
or some of those calls to get out of the lazy mode, but I want to avoid it as much as possible.
I’m quite lost, been trying for a while with no success.
Thanks!!