Confused about working with sparse arrays

deklanw · April 11, 2023, 9:10pm

My trying to work with sparse arrays and Dask. My goals are to:

construct a sparse array incrementally, row-wise, without blowing my memory budget (writing chunks of rows to disk incrementally).
read that format into Dask
perform a sparse matrix - vector multiplication with Dask (also not blowing my memory or time budget)

I’m a bit lost about how to accomplish these goals.

An old SO comment by @mrocklin from 2019 about loading sparse arrays with Dask says

Ideally you would use a storage technology that allowed for sparse storage, or at least random access. TileDB is an interesting choice today.

Today I would probably store data in HDF5 or Zarr with some appropriate compression to hide the cost of the zeros, then call

x = da.from_array(storage, chunks='auto')
x = x.map_partitions(sparse.COO)

Just as you have above, but where storage is something that provides random access from disk.

But, there is an open issue about sparse TileDB integration, so I assume that’s not possible.

For HDF5 and Zarr, there doesn’t appear to be a canonical format for saving sparse arrays? If so, what kind of object is storage supposed to be in the above code?

In this old Github thread comments are about how it would be convenient if scipy had an ndarray-API-compatible class. That’s now a reality, I think. But, the docs on the Dask docs about sparse arrays are using the separate sparse package. Why?

I’m very confused by all of this. Sorry if these questions are obvious. Any help appreciated

guillaumeeb · April 12, 2023, 6:20pm

Hi @deklanw, welcome to Dask Discourse!

Sorry, I really don’t know many things in sparse Array, but I can try to explain what I understand from your post.

For HDF5 and Zarr, there doesn’t appear to be a canonical format for saving sparse arrays? If so, what kind of object is storage supposed to be in the above code?

No you are right, but as @mrocklin is saying in his comment, you can still use them because of compression of chunks which will make chunks full of 0 almost no volume on disk. So storage is a Zarr or HDF5 store.

Maybe the doc has just not been updated, but if you know a Scipy Sparse Array implementation, you can use it just as sparse module in the documentation.

Topic		Replies	Views
Constructing a sparse dask array from Numpy arrays Dask Array	1	184	September 7, 2023
Optimal Chunking for Random Access Dask Array zarr , optimization	3	61	July 24, 2024
Help with dask/zarr usage -- performance issues with dask Dask Array zarr	10	1938	January 17, 2023
Dataframe from sparse array Dask DataFrame	0	455	August 18, 2022
Compression Levels while storing Dask DataFrame & Dask Array Dask DataFrame dask-array , distributed	2	660	July 22, 2022

Confused about working with sparse arrays

Related topics