Confused about working with sparse arrays

My trying to work with sparse arrays and Dask. My goals are to:

  1. construct a sparse array incrementally, row-wise, without blowing my memory budget (writing chunks of rows to disk incrementally).
  2. read that format into Dask
  3. perform a sparse matrix - vector multiplication with Dask (also not blowing my memory or time budget)

I’m a bit lost about how to accomplish these goals.

An old SO comment by @mrocklin from 2019 about loading sparse arrays with Dask says

Ideally you would use a storage technology that allowed for sparse storage, or at least random access. TileDB is an interesting choice today.

Today I would probably store data in HDF5 or Zarr with some appropriate compression to hide the cost of the zeros, then call

x = da.from_array(storage, chunks='auto')
x = x.map_partitions(sparse.COO)

Just as you have above, but where storage is something that provides random access from disk.

But, there is an open issue about sparse TileDB integration, so I assume that’s not possible.

For HDF5 and Zarr, there doesn’t appear to be a canonical format for saving sparse arrays? If so, what kind of object is storage supposed to be in the above code?

In this old Github thread comments are about how it would be convenient if scipy had an ndarray-API-compatible class. That’s now a reality, I think. But, the docs on the Dask docs about sparse arrays are using the separate sparse package. Why?

I’m very confused by all of this. Sorry if these questions are obvious. Any help appreciated

Hi @deklanw, welcome to Dask Discourse!

Sorry, I really don’t know many things in sparse Array, but I can try to explain what I understand from your post.

For HDF5 and Zarr, there doesn’t appear to be a canonical format for saving sparse arrays? If so, what kind of object is storage supposed to be in the above code?

No you are right, but as @mrocklin is saying in his comment, you can still use them because of compression of chunks which will make chunks full of 0 almost no volume on disk. So storage is a Zarr or HDF5 store.

Maybe the doc has just not been updated, but if you know a Scipy Sparse Array implementation, you can use it just as sparse module in the documentation.