Indexing a dask array with a boolean array

miguelcarcamov · May 4, 2022, 8:47pm

Hi everyone,

I’ve been having this question a long time ago, and I’m not pretty sure about the answer. When indexing a dask array is it better to do it with a numpy array or a dask array? I have a code that calculates valid data → creates an array of True and False, True for those valid data, and False otherwise. The problem is that if I index the data this way I get (nan,) shape chunks. Is there a way to avoid this? I know a way to fix this is to do a compute_chunk_sizes(), but I guess that takes a lot of computing time. Is there another way to avoid it? or a way to do it efficiently?


import dask.array as da

a = da.random.random((10,))
c = da.random.randint(low=0, high=1, size=(10,))
a[c.astype(bool)]
dask.array<getitem, shape=(nan,), dtype=float64, chunksize=(nan,), chunktype=numpy.ndarray>

Cheers

ParticularMiner · May 5, 2022, 9:17am

Welcome @miguelcarcamov !

Good question.

Applying a “mask” (which is the formal term describing the array of booleans you are referring to as an index) is one of those exceptional operations where even when you know the structure (that is, the chunk-shapes) of the array and mask, it is still impossible to determine the structure of the result without first computing the mask completely. So any mask-operation without the full knowledge of the mask invariably leads to a result whose structure is unknown [that is, with (numpy.nan,) chunks] prior to its computation.

So you are forced either to

first compute the mask (converting it to a numpy array of booleans) before applying it to the array, or to
use .compute_chunk_sizes() after applying the uncomputed mask.

As to the question of which method is better: that, I guess, is a matter of convenience. For instance, if computation of the mask leads to an array that is too large to fit into RAM, then obviously it is better not to compute it, and your only choice is to use the latter method. The question of which is faster, on the other hand, is best answered through actual testing of both methods, which I encourage you to investigate yourself if you want. And if you do choose to benchmark these methods, I’ll be grateful if would report your results here for other readers to be informed.

Cheers!

ParticularMiner · May 19, 2022, 6:48am

@miguelcarcamov

See this post for some crude benchmarks in a slightly different context.

Topic		Replies	Views
Boolean array indexing rules do not follow those of numpy Dask Array dask-array , indexing	1	449	April 28, 2023
Best way to process a large array given a smaller binary mask Dask Array	0	203	October 26, 2022
Dask slower than numpy Dask Array	1	368	August 23, 2022
Processing array subregions Dask Array dask-array	9	497	April 11, 2023
Slicing with dask.array of bools Dask Array dask-array	2	193	December 15, 2023

Indexing a dask array with a boolean array

Related topics