Instanciate a chunk insided a mapped function in order to construct a dask array from scratch?

Instanciate a chunk insided a mapped function in order to construct a dask array from scratch?

Context

Hello,

I create this topic regarding the creation of a dask array from scratch. I read the documentation article
Array Creation and saw an example in the
section Often it is substantially faster to use da.map_blocks rather than da.stack. Note that the example
is unrelated at first to this topic, but show an interesting use of dask.

Indeed, the example constructs a dask array by calling map_blocks from dask (da.) directly rather than from an existing dask array.

An advantage of this method is that since the chunk is created inside of the function, instead of received
as an immutable argument (like in the case of a chunk received from the call of map_blocks on a dask array),
the chunk is writeable and we have full control over it. It can be useful for functions that takes references
of empty numpy arrays that are meant to be filled (written in-place).

The code below is adapted from the code snippet in Dask documentation on Array Creation below the sentence Often it is substantially faster to use da.map_blocks rather than da.stack:

from typing import Any
import numpy as np
import dask.array as da
import numpy.typing as npt


def read_one_image(block_id: tuple[int, int]) -> npt.NDArray[Any]:
    new_chunk_numpy_array = block_id[0] * 10 + block_id[1] * np.ones((2, 2), dtype=np.uint8)
    return new_chunk_numpy_array


dask_array = da.map_blocks(
    read_one_image,
    dtype=np.uint8,
    chunks=((2, 2), (2, 2)),
)

print(dask_array)
dask.array<read_one_image, shape=(4, 4), dtype=uint8, chunksize=(2, 2), chunktype=numpy.ndarray>
numpy_array = dask_array.compute()
print(numpy_array)
[[ 0  0  1  1]
 [ 0  0  1  1]
 [10 10 11 11]
 [10 10 11 11]]

Here are three questions:

(1) Is it a good practice to initiate such arrays from scratch? My intuition is that it is the only way, eg when reading
from multi-file datasets (like chunked Zarr arrays). If you have some examples of code instanciating such dask arrays
from scratch, it would be useful!

(2) If (1) is OK, I assume such mutability can only be used at the start of the processing chain, eg, the initial
functions creating the dask array, but then. We have one shot upon the array instanciation.

(3) The only other way to mutate in-place a chunk of a dask array would be not to mutate it, but instead instanciating
a copy of it with .copy() (if (1) is OK) at the start of the function that is mapped, applying computation on it and
returning it, inside of the original chunk. This leads to more memory usage (the cost of immutability), and should be
the last resort if dask functions are not enough?

Concrete real-life problem: adapting legacy python code using Numpy

One use case a Dask user might face is adapting an existing code base written in Python, that is making extensive use
of in-place assignments in mutable Numpy arrays. Such logic breaks real fast when replacing the original Numpy
arrays by Dask arrays. Having some guidance on how to deal with such situations might lower significantly
the barrier of entry to Dask. I looked for such page on the documentation but do not exist (yet?)

Links

dask.array.Array.copy

Copy array. This is a no-op for dask.arrays, which are immutable


Why is it undesirable for delayed functions to mutate inputs?

General justification of having immutable inputs (listed in the doc Don’t mutate inputs for Delayed, but I assume this is also true for Dask collections like Dask arrays)


Allowing setitem-like operation on dask array #2000

and

Assignment

This documentation page shows various forms of assignment supported by dask arrays. It is related to the topic
as it lists only ways to “update” a Dask array (more precisely, adding another operation on top the dask graph).

I initially saw map_blocks as a way to have full control over the received chunk, which was a misunderstanding,
since the chunks are immutable according to my current understanding.

1 Like

Hello again,

Well it’s definitely not a bad practice, but it depends on your use case!

I’m not sure I understand your point here. If you instanciate with Zarr, why not use from_zarr?

Probably, but as mentioned in your other topic, things are still unclear to me.

With that, you’re sure that you’ll don’t get any trouble.