Hi everyone,
Introduction
We use 3D exploration seismology dataset, a structured binary with 3600 bytes of metadata followed by sequential “time-series” with headers, regardless of underlying tensor dimensionality.
Current Solution
We wrote a tool to convert this back and forth to Zarr, and the export back to SEG-Y uses Dask for parallel execution and writing to SEG-Y. You can find the relevant section here. Step by step (the highlighted area):
- Flatten (rechunk) the three arrays along axis = 0
- Create dask keys
- Create task graph as dict. The function here takes the three and writes a stack of partial SEG-Y files
- Convert task graph to HLG
- Execute
- Combine all the stacks into one file in order
You will see similarities to dask.array.to_npy_stack, as we used this function as template and expanded upon it. The difference is, we flatten 3 arrays instead of one, then the serialization to SEG-Y is different than NPY files.
Issues
However, we run into memory issues doing this way when the dimensions of the data across other dimensions are large.
For example, the dataset could be in shape (3500, 13500, 4001)
with chunks (128, 512, 128)
where the last dimension is the “time-seres”. So we need to rechunk this dataset into (128, 13500, 4001)
to be able to sequentially write to a stack of files. With fp32 values, this would require 27GB of memory PER worker.
Proposed Solution?
We are experimenting with applying this function to write out slices within a chunk (locally instead of flattening the whole array) in the desired order, and then combine them incrementally with the next graph levels as they complete. However, it is not a trivial task as we need to create sub-layers within the block to be combined with the next layer of the graph.
See image below for our proposed solution’s pseudo-graph with given configuration:
Array Shape = (6, 4, 10)
, Chunk sizes (3, 2, 10)
We serialize along last axis so you can ignore that
Which makes these (spatial) blocks
-
(0, 0)
is(slice(0, 3), slice(0, 2))
-
(0, 1)
is(slice(0, 3), slice(2, 4))
-
(1, 0)
is(slice(3, 6), slice(0, 2))
-
(1, 1)
is(slice(3, 6), slice(2, 4))
For each block we serialize along axis=0 and we would have 3 slices each because our chunksize along that axis is 3. So each one gets serialized independently.
Then we concat them on disk as a whole slice across all blocks are written.
Finally, when they’re all written, we concatenate all files.
What do you think would be a good solution to this? Are we on the right track?
Visual Examples
Some visual explanations:
First, these assumptions:
- Below is the chunk layout of the 3D data (ignore sample axis)
- Each chunk holds 5x5 traces, i.e. 25 traces.
- A “block” is a complete set of crosslines for an inline chunk; and vice-versa for crossline blocks.
If we look at the four chunks in the bottom left corner, they will look like this.
For now, let’s assume this is the whole seismic cube.
However, we want to write them out like this, inline sorted:
The challenge is if we write each chunk individually and then concatenate them, the trace order gets messed up. It would look something like this: