Serializing 3D dataset with chunks to sequential binary on disk without breaking order?

Hi everyone,

Introduction

We use 3D exploration seismology dataset, a structured binary with 3600 bytes of metadata followed by sequential “time-series” with headers, regardless of underlying tensor dimensionality.

Current Solution

We wrote a tool to convert this back and forth to Zarr, and the export back to SEG-Y uses Dask for parallel execution and writing to SEG-Y. You can find the relevant section here. Step by step (the highlighted area):

  1. Flatten (rechunk) the three arrays along axis = 0
  2. Create dask keys
  3. Create task graph as dict. The function here takes the three and writes a stack of partial SEG-Y files
  4. Convert task graph to HLG
  5. Execute
  6. Combine all the stacks into one file in order

You will see similarities to dask.array.to_npy_stack, as we used this function as template and expanded upon it. The difference is, we flatten 3 arrays instead of one, then the serialization to SEG-Y is different than NPY files.

Issues

However, we run into memory issues doing this way when the dimensions of the data across other dimensions are large.

For example, the dataset could be in shape (3500, 13500, 4001) with chunks (128, 512, 128) where the last dimension is the “time-seres”. So we need to rechunk this dataset into (128, 13500, 4001) to be able to sequentially write to a stack of files. With fp32 values, this would require 27GB of memory PER worker.

Proposed Solution?

We are experimenting with applying this function to write out slices within a chunk (locally instead of flattening the whole array) in the desired order, and then combine them incrementally with the next graph levels as they complete. However, it is not a trivial task as we need to create sub-layers within the block to be combined with the next layer of the graph.

See image below for our proposed solution’s pseudo-graph with given configuration:
Array Shape = (6, 4, 10), Chunk sizes (3, 2, 10)
We serialize along last axis so you can ignore that
Which makes these (spatial) blocks

  • (0, 0) is (slice(0, 3), slice(0, 2))
  • (0, 1) is (slice(0, 3), slice(2, 4))
  • (1, 0) is (slice(3, 6), slice(0, 2))
  • (1, 1) is (slice(3, 6), slice(2, 4))

For each block we serialize along axis=0 and we would have 3 slices each because our chunksize along that axis is 3. So each one gets serialized independently.
Then we concat them on disk as a whole slice across all blocks are written.
Finally, when they’re all written, we concatenate all files.

What do you think would be a good solution to this? Are we on the right track?

Visual Examples

Some visual explanations:
First, these assumptions:

  1. Below is the chunk layout of the 3D data (ignore sample axis)
  2. Each chunk holds 5x5 traces, i.e. 25 traces.
  3. A “block” is a complete set of crosslines for an inline chunk; and vice-versa for crossline blocks.
    image

If we look at the four chunks in the bottom left corner, they will look like this.
For now, let’s assume this is the whole seismic cube.
image

However, we want to write them out like this, inline sorted:
image

The challenge is if we write each chunk individually and then concatenate them, the trace order gets messed up. It would look something like this:
image

It sounds like the fundamental bottleneck on this workflow is rechunking. In that sense, your issue is similar to the one discussed here:

That discussion led to the creation of the rechunker package:

With the right combination of target chunks and encoding (no compression), you might be able to simply rechunk your data to zarr and then concatenate all of the zarr chunks into the desired monolithic binary file.