If I have an operation like dask_array_object.blocks
and iterate over the blocks resulting from that, how can I get the original i,j,k
location of each part inside of the function that is executing in a blockwise fashion?
Also, I noticed that there is an option of using the function dask.array.blockwise()
for doing blockwise operations, same question. How can you know the original i,j,k
location of the part in the dask array inside of the function executing in a blockwise fashion?
@gavargas22 Welcome to Discourse! I think you can check out Dask Array’s map_blocks
, it includes block_id
and block_info
keyword arguments that store the chunk location and more information respectively:
import dask.array as da
x = da.random.randint(100, size=(10,10), chunks=(5,5))
def func(x, block_id=None, block_info=None):
print ("block_id = ", block_id)
print ("block_info = ", block_info)
return x+1
da.map_blocks(func, x).compute()
2 Likes
Thank you for your response!
That’s a good solution, I tried this, but the function that I am executing with map_blocks
is a function that writes to a file on disk in parallel and in your example I return x (which is the block); so when the whole operation completes, I get a full sized numpy array.
Is there a way to return something that is empty, and does not occupy the space? If so, map_blocks would be perfect for me
My data is a 300 GB dask array, and I just want to take the values of each block and write them into a file
I am atempting something like this:
import dask.array as da
x = da.random.randint(100, size=(2000,2000,2000)))
def func(x, block_id=None, block_info=None):
# Grab the values of the 3D cube from Zarr disk store
block_data = x.compute()
# Function that writes the actual values to disk
write_value_to_binary(block_data, "./file/datafile.bin")
# Attempt to release the memory?
x.close()
return x
da.map_blocks(func, x).compute()
Hi @gavargas22! In your snippet you mention you’re reading the array from Zarr, in which case you can use dask.array.from_zarr
and dask.array.to_zarr
, without necessarily needing to call compute
. There are more details here, but you can do something like:
import dask.array as da
# save zarr file
x = da.random.randint(100, size=(10,10), chunks=(5,5))
x.to_zarr('test.zarr')
# read it in
y = da.from_zarr('test.zarr')
# do some stuff
z = y[::2, 5000:].mean(axis=1)
# save result
z.to_zarr('result.zarr')
1 Like
@gavargas22 it looks like you also posted this question to stack overflow, where there is another answer– do any of these address your question?
Thank you for your responses. It partially answers my question.
It seems that I can return the smallest array I can on each iteration of map_blocks
and solve my problem.
But I also have the question of: When you use .blocks.ravel()
How do you reconstruct the original shape and block arrangement into another numpy-like object.
I want to take each block from .blocks.ravel()
do some computation and then write to a binary file at the specific i,j,k
locations where the block is supposed to be.
Essentially, I am writing a binary file in a piecewise fashion.
Hi @gavargas22
It turns out numpy
is quite strict about what one can and cannot be converted into an array. Essentially an object must be “array-like” to be able to get converted. .blocks.ravel()
is not, since its items have different shapes in general.
You could however spoof numpy
into thinking .blocks.ravel()
is array-like by wrapping each of its items with an arbitrary non-array-like object:
import numpy as np
import dask.array as da
class wrapped():
def __init__(self, block):
self.view = block
# an example dask array `x` of arbitrary shape and chunks:
x = da.from_array(np.arange(4*3*5).reshape((4, 3, 5)), chunks=2)
x_block_list = x.blocks.ravel()
blocks = np.array([wrapped(block) for block in x_block_list])
which can then be reshaped into the desired structure:
block_array = blocks.reshape(x.numblocks)
whose elements can be referenced with block indices. For example,
block_array[1, 0, 2].view.compute()
But I find the above approach somewhat unwieldy, since the easiest way to directly reference a block of a dask
array x
(if that’s what you really want) is to do:
x.dask[(x.name, 1, 0, 2)]
3 Likes