Efficient dask array repeat without further rechunking

noah822 · August 10, 2023, 7:15am

I wonder is there any efficient way to build up a new dask array by repeating a base array for multiple times and I want each row of the new array is a chunk itself . Say I want to form a 2-d array by stacking 1-d numpy array. I am currently doing it in a numpy style way

np_a = np.ones(10_000)    # dummy example
da_a = da.from_array(np_a, chunks=(10_000, ))
stacked_a = da.repeat(da_a[None:, ], repeats=200, axis=0)
# up to this point
>>> da_a.numblocks
>>> (1, 1)
# I need to do a further rechunk to get the configuration I want
stacked_a = stacked_a.rechunk((200, 10_000))

I actually thought stacked_a will naturally be of chunk size (1, 10_000), but it turns out not to be the case. And I have looked at the task graph of stacked_a, which is a horribly wide graph. I guess this might be the bottleneck of my project. My scheduler’s memory always spikes up after running for a while. Is this huge task graph can be a possible reason for that?
Thanks in advance

guillaumeeb · August 10, 2023, 1:43pm

Hi @noah822, welcome to Dask discourse forum!

Yes, repeat does not add a new axis to your Array.

Having a wide graph is not a problem, it can be the opposite, as it could mean your array creation operation is embarrassingly parallel, which is good!

In order to directly get the correct shape, I would do as follow:

import numpy as np
import dask.array as da

np_a = np.ones(10_000)    # dummy example
da_a = da.from_array(np_a, chunks=(10_000, ))

stacked_a = da.stack([da_a]*200, axis=0)
stacked_a

Which results in this Dask array:

Building this array should be embarrassingly parallel, so you shouldn’t have any memory spike coming from this. It then depends on what you are doing afterwards.

Topic		Replies	Views
Many task transfers during reshaping/rechunking of array Dask Array distributed	1	245	January 16, 2023
Dask array, twice delayed Dask Array dask-array , distributed	6	801	February 23, 2022
What is the best approach to manage a large number of small arrays? Dask Array	1	239	September 28, 2022
dask.Array copy behaviour Dask Array	0	172	September 22, 2022
Overlapping Computations and Cloning Dask Array dask-array , distributed	11	709	May 2, 2023

Efficient dask array repeat without further rechunking

Related topics