What is the best approach to manage a large number of small arrays?

Arsenal591 · September 28, 2022, 6:45am

Hi friends,

I have a use case that I need to manage a number of small arrays (5k arrays, each array no larger than 1000 * 100), and each array is of a specific category. So I think there are two apporaches:

Create 5k dask arrays. But I think array sizes of 1000*100 is too small, and will hurt dask’s performance.
Concatencate all arrays to one “big” dask array, and let dask select a “good” chunk size.

For the second approach, I might have to append each array with a column of category id before concatencating, in order to obtain original arrays from the “big” array. For example, the “big” array X might be like:

x x x 0
x x x 0
y y y 1
y y y 1
y y y 1
z z z 2
z z z 2

In order to get the second sub array (with y s), I should use X[X[3] == 1]. But I think this is not efficient enough, because I only need to find the first and the last occurance of 1, but X[X[3] == 1] will compute all values in the 4th column with 1.

So do you have idea on my use case? What is the best approach to do this? Thanks a lot.

guillaumeeb · September 28, 2022, 11:11am

Hi @Arsenal591,

Could you be more precise about your small arrays shape? Is it one specific shape for each one?
Also, could you also precise if the computation you apply on each array are independent (e.g. embarrassingly parallel problem)?

You can have non repeated block on Dask arrays (see Chunks — Dask documentation), and maybe also non regular ones?
If the computation you have to perform is embarrassingly parallel, then you migh just want to use Delayed on simple Numpy arrays rather than building a Dask array.

Topic		Replies	Views
[New User] How to use dask to handle a large file from concatenation of many small files Dask Array	1	322	February 17, 2023
Most efficient way to copy from Dask array to Numpy Dask Array dask-array	2	57	December 4, 2024
Summing multiple non-contiguous subsets of an array Dask Array	5	767	February 22, 2022
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1945	January 17, 2023
54k of small files. Is Dask good for it? Dask DataFrame	7	618	March 17, 2023

What is the best approach to manage a large number of small arrays?

Related topics