Hi,
I have a fairly large numpy array that is generated using a library that does not work well with dask. I want to operate on that array using dask, so I use da.from_array to convert the numpy array to a dask array. However, the conversions causes an out of memory error which I traced to
dask/dask/array/_array_expr/_collection.py at 9c41044af1c5421ba019faf6329f43df43c9fc62 · dask/dask · GitHub. Why does from_array make a copy? Should this be documented? Is there a danger in removing the copy statement? Thanks.
There was some discussion on this when it was implemented over at a PR (no issue): Ensure that from_array creates a copy by phofl · Pull Request #11524 · dask/dask · GitHub. I was (and still am :-)) in favour of not copying, or else having a copy
keyword that defaulted to True that allowed us to not copy if we chose.
Hi @joshua-gould,
And thanks to @joshua-gould who points at the right discussion about this!
Bottom line is: you don’t want to use from_array when working with large Numpy arrays.
I understand here that this would be logical, but it’s also hard to make Dask work fine in distributed mode without some copy.
I don’t see any good solution here, large arrays should always been read from Workers side, or inside tasks in a non Distributed setup. That would mean in this case to dump the Array to storage in a distributed friendly format to read it with Dask afterwards, I’m not really fond of using a storage layer there neither.
Maybe you should both open a new github issue about that.
Won’t this code silently “fail” too in the same way that creating a dask array from a numpy array without copying would?
import dask.array as da
import zarr
z = zarr.zeros(shape=(2,3))
d = da.from_zarr(z)
z[0] = 2
d.compute()
Instead of copying within da.from_array, what about creating a read only view?
result = arr.view()
result.flags.writeable = False
Or do not copy in da.from_array in writeable is arr.flags.WRITEABLE is false and arr.flags.OWNDATA is false?
If you are talking about modifying an array in place, yes it would. Dask collection should be considered immutables.
But it we’re talking about memory efficiency, then it’s much better (without the d.compute()).
It’over my knowledge here, but keep in mind that Dask is made to work in multi processing and distributed mode, without any shared memory between processes.