Dask array mutation when scheduling multiple operations

azagajewski · April 27, 2023, 10:38am

Hello! I was wondering whether it’s better for parallelism to schedule multiple function calls (which may or may not reshape it) on a single mutable dask.array object:

data = dask.array.from_zarr(...) 
data = function1(data)
data = function2(data)
data = function3(data)

or whether it’s better to have separate data objects for each operation:

data0 = dask.array.from_zarr(...) 
data1 = function1(data0)
data2 = function2(data1)
data3 = function3(data2)

I’ve been playing with a single mutable array (first option), which works well, but I did notice that sometimes successive functions wait for the previous function to complete fully (where they should be able to run in parallel) - especially custom functions implemented via map_blocks.

I’m wondering if the second option might be a way of resolving that.

guillaumeeb · April 28, 2023, 6:11am

Hi @azagajewski,

Dask Arrays (and all Dask collections) are immutable objects. Which means when you apply an operation on it, you create new values, you don’t modify it inplace.

Anyway, regarding your examples, I’m not sure there is a real difference between the two. In both cases, you just link function* calls to the output of the previous call. In one case you keep a reference to intermediate results (the second code snippet), in the other you don’t (which is better for memory efficiency if you don’t need it).

There is not enough information on your workflow to tell if you can really run the functions in parallel, but in its current state, you are asking Dask to wait for the previous call (at least block by block) to be able to continue to the next function call. Could you produce a complete Minimum Reproducible Example so that we can better help you?

Topic		Replies	Views
Why is it undesirable for delayed functions to mutate inputs? delayed	7	600	December 16, 2021
Timing and frequency of execution of delayed functions Distributed delayed	2	34	September 9, 2024
Passing dask objects to delayed computations without triggering compute Dask Array dask-array , delayed	2	381	January 20, 2023
Perform the same operation on all columns of a dask dataframe in parallel Dask DataFrame delayed , distributed , dask-ml	5	215	November 10, 2022
How to properly use Dask delayed on a function that calls other functions Deploying Dask delayed	11	346	August 13, 2023

Dask array mutation when scheduling multiple operations

Related topics