Hello! I was wondering whether it’s better for parallelism to schedule multiple function calls (which may or may not reshape it) on a single mutable dask.array object:
data = dask.array.from_zarr(...)
data = function1(data)
data = function2(data)
data = function3(data)
or whether it’s better to have separate data objects for each operation:
data0 = dask.array.from_zarr(...)
data1 = function1(data0)
data2 = function2(data1)
data3 = function3(data2)
I’ve been playing with a single mutable array (first option), which works well, but I did notice that sometimes successive functions wait for the previous function to complete fully (where they should be able to run in parallel) - especially custom functions implemented via map_blocks.
I’m wondering if the second option might be a way of resolving that.
Hi @azagajewski,
Dask Arrays (and all Dask collections) are immutable objects. Which means when you apply an operation on it, you create new values, you don’t modify it inplace.
Anyway, regarding your examples, I’m not sure there is a real difference between the two. In both cases, you just link function*
calls to the output of the previous call. In one case you keep a reference to intermediate results (the second code snippet), in the other you don’t (which is better for memory efficiency if you don’t need it).
There is not enough information on your workflow to tell if you can really run the functions in parallel, but in its current state, you are asking Dask to wait for the previous call (at least block by block) to be able to continue to the next function call. Could you produce a complete Minimum Reproducible Example so that we can better help you?