Saving large dask arrays one block at a time, without first persisting in memory

azagajewski · March 27, 2023, 4:28pm

Hi all! For my workflow, I’m working with large dask arrays of size (1000,1000,1000,10,10,10) or so. I have them chunked along the first dimension, producing chunks of size (1,1000,1000,10,10,10), which makes sense for my dataset. I am able to lazily load and schedule numpy operations that transform it (such as dask.array.sum), no problem. However, I would like to add to the graph functions that visit each block and do side jobs on it, such as save it, or run other code on it without transforming the data.

After a lot of effort, the only way I can get this to execute is to persist my array, and then compute the side functions after:

data = np.sum(data) #Other operations also work
data = data.persist()
data = data.map_blocks( save_blockwise, dtype=data.dtype).compute()

Where save_blockwise just takes a block, saves it appropriately, and returns the block without changes. However, with this approach I’m worried I have to bring my entire large array into memory first with persist before saving anything. Ideally, I would schedule the save with all the other operations, and allow that to happen asynchronously. Something like:

data = np.sum(data) #Other operations also work
data = data.map_blocks( save_blockwise, dtype=data.dtype)
data = data.persist()

Note, if I do that, the operation completes and no errors are raised, but no files are written. Inspecting the graph before and after map_blocks shows that nothing further has been added to the graph.

Dask is a really cool tool, hope someone could help a new user out a little!

guillaumeeb · March 28, 2023, 10:06am

Hi @azagajewski,

Could you give us a complete minimum working example of your use case?

At first sight, I don’t see why you’ll need to persist your array first, and yes this would consume a lot of memory on your Dask cluster.

One other thing:

If you return the block and call compute(), you’ll return the whole array as a Numpy one into your client process, which will probably blow up your Client memory. If you need a custom save operation and can’t use the built in function like to_zarr, you should make an operation that doesn’t return anything in order to be able to call compute on it.

azagajewski · April 27, 2023, 10:14am

Sorry for the late response, it was a small bug in my code, all good now Thank you for your response though!

Topic		Replies	Views
Prevent dask array from `compute()` behavior Dask Array dask-array	9	896	March 19, 2022
Parallelize or map chunks of arrays with different sizes, shapes and number of blocks Dask Array dask-array	4	636	July 31, 2023
Dask array, twice delayed Dask Array dask-array , distributed	6	795	February 23, 2022
How to Parallel Saving Many Large Dask Arrays Distributed dask-array , delayed , future , distributed	4	371	January 17, 2023
Computing chunks locally before sending to workers with map_blocks Distributed	1	23	July 18, 2024

Saving large dask arrays one block at a time, without first persisting in memory

Related topics