I wonder is there any efficient way to build up a new dask array by repeating a base array for multiple times and I want each row of the new array is a chunk itself . Say I want to form a 2-d array by stacking 1-d numpy array. I am currently doing it in a numpy style way
np_a = np.ones(10_000) # dummy example
da_a = da.from_array(np_a, chunks=(10_000, ))
stacked_a = da.repeat(da_a[None:, ], repeats=200, axis=0)
# up to this point
>>> da_a.numblocks
>>> (1, 1)
# I need to do a further rechunk to get the configuration I want
stacked_a = stacked_a.rechunk((200, 10_000))
I actually thought stacked_a
will naturally be of chunk size (1, 10_000), but it turns out not to be the case. And I have looked at the task graph of stacked_a
, which is a horribly wide graph. I guess this might be the bottleneck of my project. My scheduler’s memory always spikes up after running for a while. Is this huge task graph can be a possible reason for that?
Thanks in advance