Why do chunks get inverted?

TMillenaar · August 17, 2023, 7:41am

Hey,

I have a question about the reasoning behind the implementation of chunks.
When I call map_blocks on two arrays, naturally the chunks need to match.
I noticed by trail and error that the chunks need to align in reverse, if the dimensions don’t align.
To illustrate this, see the following example and note that arr2 and arr3 have their chunks swapped:

import numpy as np
import dask.array as da

def add(x,y):
    return x + y

arr1 = da.arange(10, chunks=((1,3,3,3),))
arr2_np = np.arange(100).reshape((10, 10))
arr2 = da.from_array(arr2_np, chunks = ((1,3,3,3), (5,5)) )
arr3 = da.from_array(arr2_np, chunks = ((5,5), (1,3,3,3)) )

result = da.map_blocks(add, arr1, arr2).compute()
result_inverse = da.map_blocks(add, arr1, arr3).compute()

The first map_blocks call (with arr2) fails with the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/base.py", line 310, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/base.py", line 589, in compute
    dsk = collections_to_dsk(collections, optimize_graph, **kwargs)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/base.py", line 362, in collections_to_dsk
    dsk = opt(dsk, keys, **kwargs)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/array/optimization.py", line 50, in optimize
    dsk = fuse_roots(dsk, keys=keys)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 1535, in fuse_roots
    new = toolz.merge(layer, *[layers[dep] for dep in deps])
  File "/home/timo/Documents/open_source/tmillenaar/dask/venv/lib/python3.10/site-packages/toolz/dicttoolz.py", line 39, in merge
    rv.update(d)
  File "/usr/lib/python3.10/_collections_abc.py", line 886, in __iter__
    yield from self._mapping
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 494, in __iter__
    return iter(self._dict)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 470, in _dict
    dims=self.dims,
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 446, in dims
    self._dims = _make_dims(self.indices, self.numblocks, self.new_axes)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 1484, in _make_dims
    dims = broadcast_dimensions(indices, numblocks)
  File "/home/timo/Documents/open_source/tmillenaar/dask/dask/blockwise.py", line 1475, in broadcast_dimensions
    raise ValueError("Shapes do not align %s" % g)
ValueError: Shapes do not align {'.0': {2, 4}, '.1': {4}}

The second attempt (with arr3) works.

Intuitively, I expected the first (and only) axis of arr1 to align with the first axis arr2, but it seems instead it aligns with the last axis of arr2.
This is because the order is deliberately flipped here:

github.com

dask/dask/blob/461355e117958354ed30e29893afecd5e9258ff0/dask/array/core.py#L802-L805


      
          argpairs = [
              (a, tuple(range(a.ndim))[::-1]) if isinstance(a, Array) else (a, None)
              for a in args
          ]

Now I wonder, why is this being flipped?

Cheers,
Timo

P.S.
I was attempting to debug an issue I filed a while back.

github.com/dask/dask

map_blocks with arrays of different sizes

opened 01:38PM - 06 Jul 23 UTC

tmillenaar

needs attention needs triage

**Describe the issue**: A function that is supplied to `dask.array.map_blocks` …can be called with multiple Dask arrays as input. This generally works if the shape of the input arrays is the same. If however the input arrays have a different shape, unexpected behavior can arise. The chunks and shapes are often not as expected. Sometimes calling `compute()` on the result will yield an incorrect array and sometimes it errors with `*** IndexError: tuple index out of range` **Minimal Complete Verifiable Example**: In this example I define a function where the second argument is unused and the first argument is returned unmodified. The expectation therefore is that the result is unaffected by the second array. ```python import numpy as np import dask.array as da def foo(arr1, arr2): return arr1 arr_3d = da.stack(3*[da.eye(2)]) arr_2d = da.stack(3*[da.arange(2)]) print("Shape of arr_3d:", arr_3d.shape) print("Shape of arr_2d:", arr_2d.shape) result = da.map_blocks(foo, arr_3d, arr_2d) print("Shape of result:", result.shape) ``` This returns: ``` Shape of arr_3d: (3, 2, 2) Shape of arr_2d: (3, 2) Shape of result: (3, 3, 2) ``` Of course I would have expected the shape of the result to be (3, 2, 2) and not (3, 3, 2). What is possibly even more telling than the shape are the chunks. If I run the code above but print the chunks instead of the shape I get: ``` Chunks of arr_3d: ((1, 1, 1), (2,), (2,)) Chunks of arr_2d: ((1, 1, 1), (2,)) Chunks of result: ((1, 1, 1), (1, 1, 1), (2,)) ``` If instead of supplying `arr_2d` to the function, I supply any non-dask object, the function works as expected. In this case I supplied None instead: ``` Chunks of arr_3d: ((1, 1, 1), (2,), (2,)) Chunks of result: ((1, 1, 1), (2,), (2,)) ``` So the mere presence of the second array will mess up the chunking and shape of the return array. **Anything else we need to know?**: I think the problem might be originating in the creation of `argpairs` here: https://github.com/dask/dask/blob/461355e117958354ed30e29893afecd5e9258ff0/dask/array/core.py#L802-L805 The variable `argpairs` will have the length of the respective arrays, but it gets inverted. In this example, the indices that go with the `argpairs` are (2, 1, 0) for the first array where the corresponding chunks are ((1, 1, 1), (2,), (2,)). For the second array we have the indices (1, 0) with the corresponding chunks ((1, 1, 1), (2,)). What I was assuming when calling the function was that the chunks are matched in order. Instead now we have the (1,1,1) corresponding to index 1 of the second argument overwriting the (2,) corresponding to index 1 of the first argument. The overwrite is occurring here: https://github.com/dask/dask/blob/461355e117958354ed30e29893afecd5e9258ff0/dask/array/blockwise.py#L192-L195 I have not tried to fiddle with `argpairs` yet for I am not very familiar with Dasks internals and I want to check here first if this is even supported functionality. **Workaround** It does work if we create an extra (dummy) dimension on the second array so it matches the number of axes of the first. The following works as expected: `arr_2d = da.expand_dims(arr_2d, 2)` Which gives the chunks: ``` Chunks of arr_3d: ((1, 1, 1), (2,), (2,)) Chunks of arr_2d: ((1, 1, 1), (2,), (1,)) Chunks of result: ((1, 1, 1), (2,), (2,)) ``` **Expectation of solution**: I am unsure if this is feature is meant to be supported. While I would love for this feature to be supported, another possible outcome of this issue could of course be the statement that we do not support `map_blocks` on arrays of different sizes. In that case an explicit error could be beneficial. **Environment**: - Dask version: 2023.3.2 and 2023.5.0 - Python version: 3.8.10 - Operating System: Ubuntu 20.04 LTS (focal) - Install method (conda, pip, source): pip

It was not given any attention so I decided to give it a try.
I realized though that I might just not understand the reason for the design decisions made in the past.

guillaumeeb · August 19, 2023, 9:48pm

Okay, I’m not an array expert, but I believe this has to do with Numpy broadcasting rules. I just tried the following, building fake chunks with similar shape as those from arr2 and arr3 in your example:

arr2_chunk = np.arange(15).reshape((3, 5))
arr1.blocks[1].compute() + arr2_chunk

results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[29], line 1
----> 1 arr1.blocks[1].compute() + arr2_chunk

ValueError: operands could not be broadcast together with shapes (3,) (3,5)

Whereas:

arr3_chunk = np.arange(15).reshape((5,3))
arr1.blocks[1].compute() +  arr3_chunk

just works:

array([[ 1,  3,  5],
       [ 4,  6,  8],
       [ 7,  9, 11],
       [10, 12, 14],
       [13, 15, 17]])

So map_blocks is just checking if the Numpy operations can be performed according to the shape of the Arrays.

TMillenaar · August 21, 2023, 2:59pm

Thanks @guillaumeeb, that must be it!

I just had a look at numpy’s broadcasting rules because of your suggestion. Intuitively I was trying to match up the first axis of arr1 with the first axis of arr2. I think I made the following mistake:

It nevertheless has some funny consequences when using map_blocks but at least I understand now where these originate.

Cheers,
Timo

Topic		Replies	Views
Map_blocks unexpected behavior adds rows to dim when specifying chunks Dask Array	2	203	August 3, 2023
Da.map_blocks introduces unexpected chunks? Dask Array dask-array	3	63	July 5, 2024
Change array shape with map_block function Dask Array	1	142	November 16, 2023
Back-shifting non-uniform-sized edge chunks to get constant-sized input to map_blocks Dask Array dask-array	1	336	August 4, 2022
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	638	July 31, 2023

Why do chunks get inverted?

Related topics