What is the best approach to manage a large number of small arrays?

Hi friends,

I have a use case that I need to manage a number of small arrays (5k arrays, each array no larger than 1000 * 100), and each array is of a specific category. So I think there are two apporaches:

  1. Create 5k dask arrays. But I think array sizes of 1000*100 is too small, and will hurt dask’s performance.
  2. Concatencate all arrays to one “big” dask array, and let dask select a “good” chunk size.

For the second approach, I might have to append each array with a column of category id before concatencating, in order to obtain original arrays from the “big” array. For example, the “big” array X might be like:

x x x 0
x x x 0
y y y 1
y y y 1
y y y 1
z z z 2
z z z 2

In order to get the second sub array (with y s), I should use X[X[3] == 1]. But I think this is not efficient enough, because I only need to find the first and the last occurance of 1, but X[X[3] == 1] will compute all values in the 4th column with 1.

So do you have idea on my use case? What is the best approach to do this? Thanks a lot.

Hi @Arsenal591,

Could you be more precise about your small arrays shape? Is it one specific shape for each one?
Also, could you also precise if the computation you apply on each array are independent (e.g. embarrassingly parallel problem)?

You can have non repeated block on Dask arrays (see Chunks — Dask documentation), and maybe also non regular ones?
If the computation you have to perform is embarrassingly parallel, then you migh just want to use Delayed on simple Numpy arrays rather than building a Dask array.