Da.map_blocks introduces unexpected chunks?

Keyn34 · July 2, 2024, 12:39pm

Hello all,

I am having trouble utilizing da.map_blocks to process chunks of a numpy array in parallel.

Consider this setup as snippets of the relevant code passages:

# Loading image, getting all relevant preprocessing steps done, and converting to a numpy array...
.
.
.
print(f"Image:")
print(f" type:   {type(image)}")
print(f" length  {len(image)}")
image_size = image.shape
print(f" shape:  {image_size}")
image_chunks = (image_size[0] / chunk_along_axis(image_size[0]),
                image_size[1] / chunk_along_axis(image_size[1]),
                image_size[2] / chunk_along_axis(image_size[2]),
                image_size[3] / chunk_along_axis(image_size[3]))
print(f" chunks: {image_chunks}")

image_dask = da.from_array(image, chunks=image_chunks)
print(f"Number of chunks: {image_dask.chunks}")
for s, chunk_series in enumerate(image_dask.chunks[0]):
    for i, chunk_depth in enumerate(image_dask.chunks[1]):
        for j, chunk_height in enumerate(image_dask.chunks[2]):
            for k, chunk_width in enumerate(image_dask.chunks[3]):
                print(f"S{s}:{chunk_series} | D{i}:{chunk_depth} | H{j}:{chunk_height} | W{k}: {chunk_width}")

processed_image = da.map_blocks(compute_with_DASK, image_dask, arg1, arg2)

chunk_along_axis returns an integer value that is used to divide an axis into even parts.
compute_with_DASK is defined as:

def compute_with_DASK(image_chunk, arg1, arg2):
    chunk_size = image_chunk.shape
    print(f"Processing with dask of chunk with size: {chunk_size}")
    processed_chunk = computing_function(image_chunk, arg1, arg2)
    return processed_chunk

When I run this code, the image-related print() statements will give me:

Image:
 type:    <class 'numpy.ndarray'>
 length:  1
 shape:  (1, 1092, 520, 520)

Which should be divided into chunks of the size:

chunks: (1.0, 273.0, 260.0, 260.0)

To my understanding, this should result in 16 chunks. This is also confirmed by printing them individually with the four nested for-loops:

S0:1 | D0:273 | H0:260 | W0: 260  # 1
S0:1 | D0:273 | H0:260 | W1: 260  # 2
S0:1 | D0:273 | H1:260 | W0: 260  # 3
S0:1 | D0:273 | H1:260 | W1: 260  # 4
S0:1 | D1:273 | H0:260 | W0: 260  # 5
S0:1 | D1:273 | H0:260 | W1: 260  # 6
S0:1 | D1:273 | H1:260 | W0: 260  # 7
S0:1 | D1:273 | H1:260 | W1: 260  # 8
S0:1 | D2:273 | H0:260 | W0: 260  # 9
S0:1 | D2:273 | H0:260 | W1: 260  # 10
S0:1 | D2:273 | H1:260 | W0: 260  # 11
S0:1 | D2:273 | H1:260 | W1: 260  # 12
S0:1 | D3:273 | H0:260 | W0: 260  # 13
S0:1 | D3:273 | H0:260 | W1: 260  # 14
S0:1 | D3:273 | H1:260 | W0: 260  # 15
S0:1 | D3:273 | H1:260 | W1: 260  # 16

Now, when I reach the da.map_blocks part, the print() statements of the compute_with_DASK function show:

Processing with dask of chunk with size: (1, 1, 1, 1)

or

Processing with dask of chunk with size: (0, 0, 0, 0)

When I want to use the data of processed_image, I get the following error:

# Traceback
.
.
.
    result.append(tuple(shape(deepfirst(a))[dim] for a in arrays))
IndexError: tuple index out of range

Where do these extra chunks of unexpected sizes come from? And do they cause the tuple index to run out of range because of the artificially added chunks?

compute_with_DASK does not alter the size of the image, so I assume the problem should not come from here.

I would be very thankful if you guys have any idea of what I am doing wrong. I really appreciate any help you can provide.

guillaumeeb · July 4, 2024, 3:02pm

Hi @Keyn34,

I suspect what you are seeing comes from the fact you are not giving any meta kwarg in map_blocks. See the documentation:

Note that map_blocks will attempt to automatically determine the output array type by calling func on 0-d versions of the inputs. Please refer to the meta keyword argument below if you expect that the function will not succeed when operating on 0-d arrays.

The meta of the output array, when specified is expected to be an array of the same type and dtype of that returned when calling .compute() on the array returned by this function. When not provided, meta will be inferred by applying the function to a small set of fake data, usually a 0-d array. It’s important to ensure that func can successfully complete computation without raising exceptions when 0-d is passed to it, providing meta will be required otherwise. If the output type is known beforehand (e.g., np.ndarray, cupy.ndarray), an empty array of such type dtype can be passed, for example: meta=np.array((), dtype=np.int32).

Keyn34 · July 5, 2024, 9:46am

Thanks a lot, @guillaumeeb. I’m not sure why I always missed this part of the documentation, and I thought the issue was how I defined my dask.array.

Unfortunately, I still receive the error chain where the tuple index gets out of range. Are there any common solutions for that or places I can look further?

guillaumeeb · July 5, 2024, 2:27pm

Well, it’s a bit hard to tell without a reproducer, would you be able to build one?

Topic		Replies	Views
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	630	July 31, 2023
Map_blocks unexpected behavior adds rows to dim when specifying chunks Dask Array	2	202	August 3, 2023
`map_blocks` with different `chunks` and `new_axis` collapses a dimension when summed Dask Array	8	44	July 17, 2024
Dask array with pytorch Dask Array	2	434	June 14, 2023
Why do chunks get inverted? Dask Array dask-array	2	181	August 21, 2023

Da.map_blocks introduces unexpected chunks?

Related topics