Welcome @cuauty!
You’re right about the behavior of dask.array.from_delayed()
. Indeed, the docs explicitly say:
“The dask array will consist of a single chunk.”
The docs also hint at the answer to your question:
“This routine is useful for constructing dask arrays in an ad-hoc fashion using dask delayed, particularly when combined with stack
and concatenate
.”
So you could use dask.array.from_delayed()
for each chunk and afterwards concatenate the results to obtain a single dask
array as in in the following:
import numpy as np
import dask
import dask.array as da
value_1 = dask.delayed(np.ones)(5)
chunk_1 = da.from_delayed(value_1, (5,), dtype=float)
value_2 = dask.delayed(np.zeros)(3)
chunk_2 = da.from_delayed(value_2, (3,), dtype=float)
array = da.concatenate([chunk_1, chunk_2], axis=0) # dask will automatically assign workers to each chunk
array.chunks # ((5, 3),)
array.compute() # array([1., 1., 1., 1., 1., 0., 0., 0.])
Naturally, you’d need to replace np.ones
and np.zeros
with your function that loads your data from storage into memory.
There are of course other ways you could consider which avoid your use of from_delayed()
, such as using map_blocks()
, for instance. But I think the above should be sufficient for your needs. Feel free to counter if it is not.