Saving large dask arrays one block at a time, without first persisting in memory

Hi all! For my workflow, I’m working with large dask arrays of size (1000,1000,1000,10,10,10) or so. I have them chunked along the first dimension, producing chunks of size (1,1000,1000,10,10,10), which makes sense for my dataset. I am able to lazily load and schedule numpy operations that transform it (such as dask.array.sum), no problem. However, I would like to add to the graph functions that visit each block and do side jobs on it, such as save it, or run other code on it without transforming the data.

After a lot of effort, the only way I can get this to execute is to persist my array, and then compute the side functions after:

data = np.sum(data) #Other operations also work
data = data.persist()
data = data.map_blocks( save_blockwise, dtype=data.dtype).compute()

Where save_blockwise just takes a block, saves it appropriately, and returns the block without changes. However, with this approach I’m worried I have to bring my entire large array into memory first with persist before saving anything. Ideally, I would schedule the save with all the other operations, and allow that to happen asynchronously. Something like:

data = np.sum(data) #Other operations also work
data = data.map_blocks( save_blockwise, dtype=data.dtype)
data = data.persist()

Note, if I do that, the operation completes and no errors are raised, but no files are written. Inspecting the graph before and after map_blocks shows that nothing further has been added to the graph.

Dask is a really cool tool, hope someone could help a new user out a little! :smile:

Hi @azagajewski,

Could you give us a complete minimum working example of your use case?

At first sight, I don’t see why you’ll need to persist your array first, and yes this would consume a lot of memory on your Dask cluster.

One other thing:

If you return the block and call compute(), you’ll return the whole array as a Numpy one into your client process, which will probably blow up your Client memory. If you need a custom save operation and can’t use the built in function like to_zarr, you should make an operation that doesn’t return anything in order to be able to call compute on it.

Sorry for the late response, it was a small bug in my code, all good now :slight_smile: Thank you for your response though!

1 Like