Hi friends,
I have a use case that I need to manage a number of small arrays (5k arrays, each array no larger than 1000 * 100), and each array is of a specific category. So I think there are two apporaches:
- Create 5k dask arrays. But I think array sizes of 1000*100 is too small, and will hurt dask’s performance.
- Concatencate all arrays to one “big” dask array, and let dask select a “good” chunk size.
For the second approach, I might have to append each array with a column of category id before concatencating, in order to obtain original arrays from the “big” array. For example, the “big” array X
might be like:
x x x 0
x x x 0
y y y 1
y y y 1
y y y 1
z z z 2
z z z 2
In order to get the second sub array (with y
s), I should use X[X[3] == 1]
. But I think this is not efficient enough, because I only need to find the first and the last occurance of 1
, but X[X[3] == 1]
will compute all values in the 4th column with 1
.
So do you have idea on my use case? What is the best approach to do this? Thanks a lot.