Hello,
From my understanding of dask arrays, there are several ways to chain mathematical operations.
1. Use dask ufuncs
import dask.array as da
array = da.random.random((10, 10), chunks=(5, 5))
output = da.abs(da.exp(da.sin(array)))
output.visualize("ufuncs.png")
The resulting graph is:
2. Use map_blocks
import dask.array as da
array = da.random.random((10, 10), chunks=(5, 5))
def f(array):
return np.abs(np.exp(np.sin(array)))
output = da.map_blocks(f, array)
output.visualize("map_blocks.png")
the resulting graph is smaller this time:
From my understanding of the documentation, map_blocks
should be called when dask ufuncs are not available, suggesting to use ufuncs otherwise.
However, the previous figures suggest that the graph with map_blocks
is smaller than when chaining ufuncs. Also, timing the code on this specific dummy example also suggests to use the map_blocks’ version:
Time with ufuncs:
%timeit array.compute()
3.38 ms ± 101 μs per loop (mean ± std. dev. of 7 runs, 100 loops each
Time with map_blocks
:
%timeit array.compute()
2.1 ms ± 77.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each).
This is all the more important as the graph grows:
array = da.random.random((10, 10), chunks=(5, 5))
for _ in range(200):
array = da.abs(da.exp(da.sin(array)))
In [79]: %timeit array.compute()
437 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
array = da.random.random((10, 10), chunks=(5, 5))
for _ in range(200):
array = da.map_blocks(f, array)
%timeit array.compute()
59.7 ms ± 4.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I’m a bit confused about the best approach for my real-world use case (large computation graph, >100 mathematical operations).
Should I prioritize map_blocks
to minimize graph size and reduce scheduler overhead, or rely on Dask ufuncs for functional clarity and potential optimizations? Are there situations where the recommendation for ufuncs might not apply?
My test cases are quite simple and only touch on a fraction of the complexity found in Dask graphs. They however suggest that relieving Dask from the necessity to track the depencies between all operations done on a chunk can help. Have others faced similar situations?
Thanks a lot for your help,