Why does dask.array.from_array make a copy?

joshua-gould · April 6, 2025, 11:22pm

Hi,

I have a fairly large numpy array that is generated using a library that does not work well with dask. I want to operate on that array using dask, so I use da.from_array to convert the numpy array to a dask array. However, the conversions causes an out of memory error which I traced to
dask/dask/array/_array_expr/_collection.py at 9c41044af1c5421ba019faf6329f43df43c9fc62 · dask/dask · GitHub. Why does from_array make a copy? Should this be documented? Is there a danger in removing the copy statement? Thanks.

davidhassell · April 8, 2025, 6:17pm

There was some discussion on this when it was implemented over at a PR (no issue): Ensure that from_array creates a copy by phofl · Pull Request #11524 · dask/dask · GitHub. I was (and still am :-)) in favour of not copying, or else having a copy keyword that defaulted to True that allowed us to not copy if we chose.

guillaumeeb · April 10, 2025, 6:41am

Hi @joshua-gould,

And thanks to @joshua-gould who points at the right discussion about this!

Bottom line is: you don’t want to use from_array when working with large Numpy arrays.
I understand here that this would be logical, but it’s also hard to make Dask work fine in distributed mode without some copy.

I don’t see any good solution here, large arrays should always been read from Workers side, or inside tasks in a non Distributed setup. That would mean in this case to dump the Array to storage in a distributed friendly format to read it with Dask afterwards, I’m not really fond of using a storage layer there neither.

Maybe you should both open a new github issue about that.

joshua-gould · April 10, 2025, 5:29pm

Won’t this code silently “fail” too in the same way that creating a dask array from a numpy array without copying would?

import dask.array as da
import zarr

z = zarr.zeros(shape=(2,3))
d = da.from_zarr(z)
z[0] = 2
d.compute()

Instead of copying within da.from_array, what about creating a read only view?

result = arr.view()
result.flags.writeable = False

Or do not copy in da.from_array in writeable is arr.flags.WRITEABLE is false and arr.flags.OWNDATA is false?

guillaumeeb · April 11, 2025, 1:00pm

If you are talking about modifying an array in place, yes it would. Dask collection should be considered immutables.
But it we’re talking about memory efficiency, then it’s much better (without the d.compute()).

It’over my knowledge here, but keep in mind that Dask is made to work in multi processing and distributed mode, without any shared memory between processes.

Topic		Replies	Views
Most efficient way to copy from Dask array to Numpy Dask Array dask-array	2	54	December 4, 2024
dask.Array copy behaviour Dask Array	0	169	September 22, 2022
How to convert a numpy array to a dask array Dask Array	3	209	September 28, 2022
Dataframe from sparse array Dask DataFrame	0	455	August 18, 2022
Create an numpy array from dask dataframe Dask DataFrame	1	1645	August 31, 2022

Why does dask.array.from_array make a copy?

Related topics