`map_blocks` with different `chunks` and `new_axis` collapses a dimension when summed

Perhaps `map_blocks` with different `chunks` and `new_axis` collapses a dimension when summed. · Issue #11201 · dask/dask · GitHub is not a bug, but do you know, if not, what the correct arguments should be?

Hi @ilan-gold, welcome to Dask Discourse forum!

Looking at your example from the issue:

import numpy as np
import dask.array as da

M = 10_000
n_blocks = 10
N = 4_000
def make_chunk():
    arr = np.random.random((M,N))
    return arr

arr = da.map_blocks(make_chunk, meta=np.array((1.,), dtype=np.float64), dtype=np.float64, chunks=((M,) * n_blocks, (N,) * 1))
def __gram_block(block):
    return block.T @ block

gram_chunk_matrices = arr.map_blocks(__gram_block, chunks=((1, ) * n_blocks, (arr.shape[1],), (arr.shape[1],)), new_axis=(1, ), dtype=arr.dtype, meta=np.array([]))
assert len(gram_chunk_matrices.shape) == 3
gram_matrix = gram_chunk_matrices.sum(axis=0).compute()
assert len(gram_matrix.shape) == 2

and the question:

The shape at the end is wrong where it seems a dimension has been collapsed.

I’m not sure I am following. It seems normal to me that applying sum over an axis will collapse it, it is the same with Numpy. Am I misunderstanding something?

@ guillaumeeb I think you might be misinterpreting -

assert gram_chunk_matrices.sum(axis=0).shape == gram_chunk_matrices.sum(axis=0).compute().shape

This fails, so I think something is up.

What are the results of both shape calls?

(400, 400) for the dask array and then (400,) for the computed numpy one, so the dimension is missing, it seems, upon compute

So I tried you example, and I think I found the issue.

You first build a dask array arr of shape (100000, 4000), with (10000,4000) chunks. You then apply __gram_block function on each chunk, and tell Dask the output array will have 10 chunks of (1, 4000, 4000) shape. But it’s not true. The result is 10 chunks of (4000,4000) shape.

Dask does not fail upon computing, but gives the result with shapes corrected. But before computing, it was expecting to have one more dimension.

So I guess that circles back to the original question of whether this is a bug or a misunderstanding. What is the new_axis argument supposed to do then if not making a new axis for chunks?

From map_blocks documentation on new_axis:

New dimensions created by the function

So this kwarg has to be used if you function creates a new axis, it will not do it for you.

If you want another behavior, you can have a lool at stack functon for example.

Ok @guillaumeeb I misunderstood that as “the function” being map_blocks i.e., this is how you tell map blocks to do something along the lines of [:, None] and not that you do it yourself. Thanks!