Efficient dask array repeat without further rechunking

I wonder is there any efficient way to build up a new dask array by repeating a base array for multiple times and I want each row of the new array is a chunk itself . Say I want to form a 2-d array by stacking 1-d numpy array. I am currently doing it in a numpy style way

np_a = np.ones(10_000)    # dummy example
da_a = da.from_array(np_a, chunks=(10_000, ))
stacked_a = da.repeat(da_a[None:, ], repeats=200, axis=0)
# up to this point
>>> da_a.numblocks
>>> (1, 1)
# I need to do a further rechunk to get the configuration I want
stacked_a = stacked_a.rechunk((200, 10_000))

I actually thought stacked_a will naturally be of chunk size (1, 10_000), but it turns out not to be the case. And I have looked at the task graph of stacked_a, which is a horribly wide graph. I guess this might be the bottleneck of my project. My scheduler’s memory always spikes up after running for a while. Is this huge task graph can be a possible reason for that?
Thanks in advance :slight_smile:

Hi @noah822, welcome to Dask discourse forum!

Yes, repeat does not add a new axis to your Array.

Having a wide graph is not a problem, it can be the opposite, as it could mean your array creation operation is embarrassingly parallel, which is good!

In order to directly get the correct shape, I would do as follow:

import numpy as np
import dask.array as da

np_a = np.ones(10_000)    # dummy example
da_a = da.from_array(np_a, chunks=(10_000, ))

stacked_a = da.stack([da_a]*200, axis=0)
stacked_a

Which results in this Dask array:
image

Building this array should be embarrassingly parallel, so you shouldn’t have any memory spike coming from this. It then depends on what you are doing afterwards.

1 Like