How to get the original i,j,k location in blockwise operation

gavargas22 · February 17, 2022, 8:08pm

If I have an operation like dask_array_object.blocks and iterate over the blocks resulting from that, how can I get the original i,j,k location of each part inside of the function that is executing in a blockwise fashion?

Also, I noticed that there is an option of using the function dask.array.blockwise() for doing blockwise operations, same question. How can you know the original i,j,k location of the part in the dask array inside of the function executing in a blockwise fashion?

pavithraes · February 18, 2022, 12:53pm

@gavargas22 Welcome to Discourse! I think you can check out Dask Array’s map_blocks, it includes block_id and block_info keyword arguments that store the chunk location and more information respectively:

import dask.array as da

x = da.random.randint(100, size=(10,10), chunks=(5,5))

def func(x, block_id=None, block_info=None):
    print ("block_id = ", block_id)
    print ("block_info = ", block_info)
    return x+1

da.map_blocks(func, x).compute()

gavargas22 · February 18, 2022, 3:59pm

Thank you for your response!

That’s a good solution, I tried this, but the function that I am executing with map_blocks is a function that writes to a file on disk in parallel and in your example I return x (which is the block); so when the whole operation completes, I get a full sized numpy array.

Is there a way to return something that is empty, and does not occupy the space? If so, map_blocks would be perfect for me

My data is a 300 GB dask array, and I just want to take the values of each block and write them into a file

I am atempting something like this:

import dask.array as da

x = da.random.randint(100, size=(2000,2000,2000)))

def func(x, block_id=None, block_info=None):
    # Grab the values of the 3D cube from Zarr disk store
    block_data = x.compute()
    # Function that writes the actual values to disk
    write_value_to_binary(block_data, "./file/datafile.bin")
    # Attempt to release the memory?
    x.close()

    return x

da.map_blocks(func, x).compute()

scharlottej13 · February 18, 2022, 11:49pm

Hi @gavargas22! In your snippet you mention you’re reading the array from Zarr, in which case you can use dask.array.from_zarr and dask.array.to_zarr, without necessarily needing to call compute. There are more details here, but you can do something like:

import dask.array as da

# save zarr file
x = da.random.randint(100, size=(10,10), chunks=(5,5))
x.to_zarr('test.zarr')
# read it in
y = da.from_zarr('test.zarr')
# do some stuff
z = y[::2, 5000:].mean(axis=1)
# save result
z.to_zarr('result.zarr')

scharlottej13 · February 21, 2022, 8:58pm

@gavargas22 it looks like you also posted this question to stack overflow, where there is another answer– do any of these address your question?

gavargas22 · February 21, 2022, 9:17pm

Thank you for your responses. It partially answers my question.

It seems that I can return the smallest array I can on each iteration of map_blocks and solve my problem.

But I also have the question of: When you use .blocks.ravel() How do you reconstruct the original shape and block arrangement into another numpy-like object.

I want to take each block from .blocks.ravel() do some computation and then write to a binary file at the specific i,j,k locations where the block is supposed to be.

Essentially, I am writing a binary file in a piecewise fashion.

ParticularMiner · February 22, 2022, 7:54am

Hi @gavargas22

It turns out numpy is quite strict about what one can and cannot be converted into an array. Essentially an object must be “array-like” to be able to get converted. .blocks.ravel() is not, since its items have different shapes in general.

You could however spoof numpy into thinking .blocks.ravel() is array-like by wrapping each of its items with an arbitrary non-array-like object:

import numpy as np
import dask.array as da


class wrapped():
    def __init__(self, block):
        self.view = block

# an example dask array `x` of arbitrary shape and chunks:
x = da.from_array(np.arange(4*3*5).reshape((4, 3, 5)), chunks=2)
x_block_list = x.blocks.ravel()
blocks = np.array([wrapped(block) for block in x_block_list])

which can then be reshaped into the desired structure:

block_array = blocks.reshape(x.numblocks)

whose elements can be referenced with block indices. For example,

block_array[1, 0, 2].view.compute()

But I find the above approach somewhat unwieldy, since the easiest way to directly reference a block of a dask array x (if that’s what you really want) is to do:

x.dask[(x.name, 1, 0, 2)]

Topic		Replies	Views
Saving large dask arrays one block at a time, without first persisting in memory Dask Array dask-array , distributed	2	867	April 27, 2023
Instanciate a chunk insided a mapped function in order to construct a dask array from scratch? Dask Array	1	92	April 26, 2024
Use map_blocks with function that returns a tuple Dask Array	6	1551	April 14, 2022
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	625	July 31, 2023
Map_blocks unexpected behavior adds rows to dim when specifying chunks Dask Array	2	201	August 3, 2023

How to get the original i,j,k location in blockwise operation

Related topics