Hi all! For my workflow, I’m working with large dask arrays of size (1000,1000,1000,10,10,10) or so. I have them chunked along the first dimension, producing chunks of size (1,1000,1000,10,10,10), which makes sense for my dataset. I am able to lazily load and schedule numpy operations that transform it (such as dask.array.sum), no problem. However, I would like to add to the graph functions that visit each block and do side jobs on it, such as save it, or run other code on it without transforming the data.
After a lot of effort, the only way I can get this to execute is to persist my array, and then compute the side functions after:
data = np.sum(data) #Other operations also work data = data.persist() data = data.map_blocks( save_blockwise, dtype=data.dtype).compute()
Where save_blockwise just takes a block, saves it appropriately, and returns the block without changes. However, with this approach I’m worried I have to bring my entire large array into memory first with persist before saving anything. Ideally, I would schedule the save with all the other operations, and allow that to happen asynchronously. Something like:
data = np.sum(data) #Other operations also work data = data.map_blocks( save_blockwise, dtype=data.dtype) data = data.persist()
Note, if I do that, the operation completes and no errors are raised, but no files are written. Inspecting the graph before and after map_blocks shows that nothing further has been added to the graph.
Dask is a really cool tool, hope someone could help a new user out a little!