My trying to work with sparse arrays and Dask. My goals are to:
- construct a sparse array incrementally, row-wise, without blowing my memory budget (writing chunks of rows to disk incrementally).
- read that format into Dask
- perform a sparse matrix - vector multiplication with Dask (also not blowing my memory or time budget)
I’m a bit lost about how to accomplish these goals.
An old SO comment by @mrocklin from 2019 about loading sparse arrays with Dask says
Ideally you would use a storage technology that allowed for sparse storage, or at least random access. TileDB is an interesting choice today.
Today I would probably store data in HDF5 or Zarr with some appropriate compression to hide the cost of the zeros, then call
x = da.from_array(storage, chunks='auto')
x = x.map_partitions(sparse.COO)
Just as you have above, but where storage is something that provides random access from disk.
But, there is an open issue about sparse TileDB integration, so I assume that’s not possible.
For HDF5 and Zarr, there doesn’t appear to be a canonical format for saving sparse arrays? If so, what kind of object is storage
supposed to be in the above code?
In this old Github thread comments are about how it would be convenient if scipy had an ndarray-API-compatible class. That’s now a reality, I think. But, the docs on the Dask docs about sparse arrays are using the separate sparse
package. Why?
I’m very confused by all of this. Sorry if these questions are obvious. Any help appreciated