Change array shape with map_block function

Hello everyone!

I have a problem that I don’t know how to solve, so I come here looking for help :slight_smile:

Basically, I have data in a zarr file and want to read it with Dask and process it. For that, I have written the following code:

def preprocess_count_matrix(x, normalization):
    if normalization == DISS_NORMALIZATION:
        return x.map_blocks(
            process_data_main, 
            max_len=2000, 
            dtype='f4',
        )
    elif normalization == 'raw':
        return x
    else:
        raise ValueError(f'NORMALIZATION not found')

XX = preprocess_matrix(da.from_zarr(join(DATA_PATH, split, f'zarr_{i}'), 'X'), NORMALIZATION)

The function process_data_main basically converts the data, no matter what, into an array whose second dimension is 2000. However, if my initial zarr file is size, let’s say (n x 500), XX is still size (n x 500) for Dask, instead of (n x 2000), even though I can call .compute() and verify that is indeed 2000.
This is a problem cause later I want to concatenate that XX to other arrays that must be, also, size (n x 2000) but can’t do it since Dask still thinks is the previous size. How can I fix this? The meta argument, as far as I know, cannot be used to set the shape of the output. I can always do .compute() or .persist() or some of those calls to get out of the lazy mode, but I want to avoid it as much as possible.

I’m quite lost, been trying for a while with no success.

Thanks!!

Hi @AlejandroTL, welcome to Dask Discourse!

I think you are looking for the chunk kwarg of map_blocks, as said in the documentation:

Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.

Dask has no way to infer the output chunks shape itself, so you must precise it.