I’ve been having this question a long time ago, and I’m not pretty sure about the answer. When indexing a dask array is it better to do it with a numpy array or a dask array? I have a code that calculates valid data → creates an array of True and False, True for those valid data, and False otherwise. The problem is that if I index the data this way I get (nan,) shape chunks. Is there a way to avoid this? I know a way to fix this is to do a compute_chunk_sizes(), but I guess that takes a lot of computing time. Is there another way to avoid it? or a way to do it efficiently?
import dask.array as da
a = da.random.random((10,))
c = da.random.randint(low=0, high=1, size=(10,))
dask.array<getitem, shape=(nan,), dtype=float64, chunksize=(nan,), chunktype=numpy.ndarray>
Welcome @miguelcarcamov !
Applying a “mask” (which is the formal term describing the array of booleans you are referring to as an index) is one of those exceptional operations where even when you know the structure (that is, the chunk-shapes) of the array and mask, it is still impossible to determine the structure of the result without first computing the mask completely. So any mask-operation without the full knowledge of the mask invariably leads to a result whose structure is unknown [that is, with
(numpy.nan,) chunks] prior to its computation.
So you are forced either to
- first compute the mask (converting it to a
numpy array of booleans) before applying it to the array, or to
.compute_chunk_sizes() after applying the uncomputed mask.
As to the question of which method is better: that, I guess, is a matter of convenience. For instance, if computation of the mask leads to an array that is too large to fit into RAM, then obviously it is better not to compute it, and your only choice is to use the latter method. The question of which is faster, on the other hand, is best answered through actual testing of both methods, which I encourage you to investigate yourself if you want. And if you do choose to benchmark these methods, I’ll be grateful if would report your results here for other readers to be informed.
See this post for some crude benchmarks in a slightly different context.