dask.Array copy behaviour

davidhassell · September 22, 2022, 9:39am

Hello,

(Originally posted over at GitHub, before I realised that things had moved here :))

I was wondering why Array.copy behaves differently when there is only 1 partition compared with when there are multiple partitions. In the single partition case, the numpy array appears to be replaced with an in-memory copy of itself (during the compute), but not so in multiple partition case:

>>> import dask.array as da
>>> x = da.from_array(list(range(10)), chunks=-1).copy()   # npartitions = 1
>>> x.dask
HighLevelGraph with 2 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f6481de3df0>
0. array-5b75f23fcdbb58b6fb4d5bf579904cdc
1. copy-f858750688dfe3c2a90b611d5f3c9339
>>> x = da.from_array(list(range(10)), chunks=5).copy()   # npartitions = 2
>>> x.dask
HighLevelGraph with 1 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f648a3011f0>
0. array-7dc27a28b597e5b932909e62d8d97007

The code clearly intends this (dask/core.py at main · dask/dask · GitHub)

    def copy(self):
        if self.npartitions == 1:
            return self.map_blocks(M.copy)
        else:
            return Array(self.dask, self.name, self.chunks, meta=self)

Is this copy really happening just in the single partition case? and if so I’d be very interested to know why, as it would affect performance.

I have an additional use case that touches on this, aside from performance, in which we are creating an implementation that interfaces dask with “active storage”, where reductions can be carried out on the server where the data is, rather than locally by dask itself, and the results for each chunk fed back into a standard dask workflow.

Our initial approach requires knowledge of whether or not a dask graph only contains a data definition, and no further operations. A copied dask array doesn’t logically have any further operations, but the presence of a copy layer, makes it much harder to determine if I have this situation.

Is this sort of graph analysis possible/sensible (it’s certainly desired!).

Many thanks,
David

Topic		Replies	Views
Why does dask.array.from_array make a copy? Dask Array	4	43	April 11, 2025
Maintaining index between .values and .to_dask_dataframe Dask DataFrame	3	130	February 23, 2024
Most efficient way to copy from Dask array to Numpy Dask Array dask-array	2	54	December 4, 2024
Overlapping Computations and Cloning Dask Array dask-array , distributed	11	691	May 2, 2023
writable shared array Dask Array	3	221	April 8, 2022

dask.Array copy behaviour

Related topics