Hello,
(Originally posted over at GitHub, before I realised that things had moved here :))
I was wondering why Array.copy
behaves differently when there is only 1 partition compared with when there are multiple partitions. In the single partition case, the numpy array appears to be replaced with an in-memory copy of itself (during the compute), but not so in multiple partition case:
>>> import dask.array as da
>>> x = da.from_array(list(range(10)), chunks=-1).copy() # npartitions = 1
>>> x.dask
HighLevelGraph with 2 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f6481de3df0>
0. array-5b75f23fcdbb58b6fb4d5bf579904cdc
1. copy-f858750688dfe3c2a90b611d5f3c9339
>>> x = da.from_array(list(range(10)), chunks=5).copy() # npartitions = 2
>>> x.dask
HighLevelGraph with 1 layers.
<dask.highlevelgraph.HighLevelGraph object at 0x7f648a3011f0>
0. array-7dc27a28b597e5b932909e62d8d97007
The code clearly intends this (dask/core.py at main · dask/dask · GitHub)
def copy(self):
if self.npartitions == 1:
return self.map_blocks(M.copy)
else:
return Array(self.dask, self.name, self.chunks, meta=self)
Is this copy really happening just in the single partition case? and if so I’d be very interested to know why, as it would affect performance.
I have an additional use case that touches on this, aside from performance, in which we are creating an implementation that interfaces dask with “active storage”, where reductions can be carried out on the server where the data is, rather than locally by dask itself, and the results for each chunk fed back into a standard dask workflow.
Our initial approach requires knowledge of whether or not a dask graph only contains a data definition, and no further operations. A copied dask array doesn’t logically have any further operations, but the presence of a copy layer, makes it much harder to determine if I have this situation.
Is this sort of graph analysis possible/sensible (it’s certainly desired!).
Many thanks,
David