Question:
Does anyone foresee any problems (in particular, scheduling issues) with delaying the creation of a dask
array until execution time?
import dask.array as da
import numpy as np
def create_array(sz):
return da.arange(sz)
indices = da.from_array([2, 0, 5])
size = da.max(indices) + 1
kwargs = {'dtype': np.int32, 'meta': np.array(0)}
_2x_dlayd_array = size.map_blocks(create_array, **kwargs)
_2x_dlayd_array.compute().compute()
Premise:
There are often times (especially in sparse matrix settings) when one needs to create a dask
array whose shape/size is unknown at graph-construction time but will be determined later during execution time.
As at now, dask
does not directly allow the creation of arrays of delayed shape/size.
One could, of course, simply call compute()
to get the shape/size during graph-construction, but this is likely to be a computationally expensive and inconvenient approach that one would prefer to defer to the very end when it is more convenient to compute.
With the above example code-snippet this becomes possible.
Naturally, this complicates how one handles the “twice-delayed” array _2x_dlayd_array
, since every subsequent result derived from this oddity is also twice-delayed. For example:
from functools import partial
do_something_to = partial(da.outer, indices)
_2x_dlayd_array = da.map_blocks(do_something_to, _2x_dlayd_array, **kwargs)
_2x_dlayd_array.compute().compute()
Any ideas, cautionary notes, etc., would be greatly appreciated!