`da.zeros_like` instanciated arrays are not writeable when `map_blocks` is run over them

da.zeros_like instanciated arrays are not writeable when map_blocks is run over them

Summary

Dask arrays created with da.zeros_like are not writeable when used alongside with map_blocks.
Trying to write in-place the received block array inside of the mapped function will produce the following error:

ValueError('assignment destination is read-only')

So, I raise this topic to understand of to overcome the problem, or if this is a misuse of Dask from my side, to find out alternatives.

The main problem I aim to solve is: how to properly write in-place received chunks inside of a function mapped over a Dask array with map_blocks?

Or, in other words: can we modify a chunk in-place?

In this notebook, I show how to reproduce the error, following the same MCVE guidelines as on the xarray GitHub. For reference:

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Thank you for your help!

Initialization

import warnings

warnings.simplefilter("ignore")
from typing import Any

import dask
import dask.array as da
import numpy as np
import numpy.typing as npt
from dask.distributed import Client
dask.__version__
'2023.11.0'
client = Client(n_workers=4, threads_per_worker=4, memory_limit="16GiB")
print(client)
<Client: 'tcp://127.0.0.1:46231' processes=4 threads=16, memory=64.00 GiB>

Test preparation

shape = (4, 4)
chunks = (2, 2)
dtype = np.int32

Case A: np.zeros

The first array is created from a numpy array, with np.zeros.

dask_array_a = da.from_array(np.zeros(np.prod(shape), dtype=dtype).reshape(shape), chunks=chunks)

print(dask_array_a)
print(dask_array_a.compute())
dask.array<array, shape=(4, 4), dtype=int32, chunksize=(2, 2), chunktype=numpy.ndarray>
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]

Case B: da.zeros_like

The second array is created with the dask equivalent of zeros_like, using the first array as a template.

dask_array_b = da.zeros_like(dask_array_a)

print(dask_array_b)
print(dask_array_b.compute())
dask.array<zeros_like, shape=(4, 4), dtype=int32, chunksize=(2, 2), chunktype=numpy.ndarray>
[[0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]
 [0 0 0 0]]

Function to be mapped

Create a dummy function, to be mapped over the chunks of the dask arrays.
It updates the top-left pixel of the given chunk to 777.

Note that this does not make fail the “pre-flight” execution of the method

Note that map_blocks will attempt to automatically determine the output array type by calling func on 0-d versions of the inputs.

Source: dask.array.map_blocks

def block_write_in_place(
    block: npt.NDArray[Any],
):
    block[0, 0] = 777
    return block

Test execution

Case A

lazy = dask_array_a.map_blocks(block_write_in_place)
result = lazy.compute()
print(result)
[[777   0 777   0]
 [  0   0   0   0]
 [777   0 777   0]
 [  0   0   0   0]]

The in-place write is done as expected, without any error.

Case B

lazy = dask_array_b.map_blocks(block_write_in_place)

try: 
    result = lazy.compute()
except ValueError as error:
    print(error)
assignment destination is read-only


2024-04-22 11:03:18,768 - distributed.worker - WARNING - Compute Failed
Key:       ('block_write_in_place-d50ef239ad34e97a5f2c6ce6add02a7a', 0, 1)
Function:  subgraph_callable-259a4044-a3c0-449e-86bc-2bc986a4
args:      ((2, 2))
kwargs:    {}
Exception: "ValueError('assignment destination is read-only')"

2024-04-22 11:03:18,768 - distributed.worker - WARNING - Compute Failed
Key:       ('block_write_in_place-d50ef239ad34e97a5f2c6ce6add02a7a', 0, 0)
Function:  subgraph_callable-259a4044-a3c0-449e-86bc-2bc986a4
args:      ((2, 2))
kwargs:    {}
Exception: "ValueError('assignment destination is read-only')"

2024-04-22 11:03:18,769 - distributed.worker - WARNING - Compute Failed
Key:       ('block_write_in_place-d50ef239ad34e97a5f2c6ce6add02a7a', 1, 0)
Function:  subgraph_callable-259a4044-a3c0-449e-86bc-2bc986a4
args:      ((2, 2))
kwargs:    {}
Exception: "ValueError('assignment destination is read-only')"

2024-04-22 11:03:18,769 - distributed.worker - WARNING - Compute Failed
Key:       ('block_write_in_place-d50ef239ad34e97a5f2c6ce6add02a7a', 1, 1)
Function:  subgraph_callable-259a4044-a3c0-449e-86bc-2bc986a4
args:      ((2, 2))
kwargs:    {}
Exception: "ValueError('assignment destination is read-only')"

Here, we can see the error: the received chunk is not writeable.

1 Like

PS: I found a similar open issue from 4 years ago on the Dask’s GitHub: np.asarray returns non-writable array #6520

1 Like

Well in some cases you can, and in others you cannot. Overall, I think it’s best to assume that Dask Array are immutables and blocks should be copied if modified.

In any case, there is at least some lack of documentation here. Maybe you could weigh in this github issue, but I’ll also ping @jakirkham @fjetter.

Hello,

Thanks for your answer!

This is quite a limitation to updating existing Python code with Numpy to make it use Dask, if the existing code performs of a lot of in-place mutations on its arrays. For example, a single array that go through lots of successive updates in a processing chain.

A workaround would be to copy and instantiate a new chunk inside of a map_blocks function, but it would mean repeatedly instantiating new chunks for each map_blocks usage and does not seems to be the best solution? However this is the only way to keep immutable inputs while reuse complex Numpy code, using dask more as a “wrapper” of Numpy code applied on chunks (allowing code-reuse).

Maybe having some documentation exposing common pitfalls when trying to reuse existing Python code using Numpy to make it work with Dask would be helpful! Maybe in this documentation page related to Dask-supported assignments on arrays (I found it useful as it exposes a list of concrete examples): Assignment — Dask documentation

From my current experiences, trying to run existing code “as is” with Dask arrays instead of Numpy ones works fine in 90% of the cases, but this is the remaining 10% that can be hard to work around, eg when trying to index a 2-D dask array with another 2-D dask array, and more complex assignments (assigning not a scalar but a full 2-D array). In some of the cases, the Python code itself can be rewritten and simplified to make it dask-friendly, but not always (I don’t have concrete examples right now unfortunately ; I will try to come up with better real life examples in the future)

Finally, it seems that the cases where the Dask arrays can be mutated (like when being built from a Numpy array) should not be relied on. Should this behaviour (described in this topic) be considered as a bug, as Dask arrays should be immutable?

I think this is the correct approach.

This is already a very good result! But yes, it can be hard in some case to migrate from Numpy to Dask.

This is a very good question, and I think for some use cases, it should be at least answered in some of Dask documentation as you said. But I’m not an Array expert, so I would prefer to have others chime in here.

All input data of dask functions must be treated at all times as strictly immutable, no ifs, no buts.

When immutability is not enforceable (e.g. a list or dict input parameter) and you do modify your input, things will go horribly wrong if you need to re-run your task at a later time or if the same data is used as input to multiple functions. Even worse, things will go unpredictably wrong, as you may get the original value most times and the updated value in sporadic occasions, or vice versa.

TL;DR: Do not modify your data in place. Ever.

2 Likes