Speeding up (indexed) column operations?

Hello dask community,

I am working in a setting where it’s not uncommon to back-end to sparse matrices to save memory. Dask has generally allowed me to avoid doing this by offloading blocks to the filesystem instead. However I’m hitting a point where that’s starting to be a pain point, so I’m switching over to a sparse.COO backend for my blocks.

In the dense world, I have been using matrix products to perform basic tabulation tasks. For instance, identifying class proportions (Y as an (n, c) indicator matrix) associated with features X (a (n,k) data matrix) below their mean value might look like

a = X.mean(axis=0)
counts = Y.T.dot(X < a)

The X < a becomes a problem in the sparse world as it is dense. My drop-in replacement is

counts = np.vstack([Y.T.dot( (X[:, i] < a[i]).compute().todense()) ) 
                    for i in X.shape[1]]).T

I’d like to use da.apply_along_axis but since a[i] is changing with each column there’s an index. And I can’t use X - a since that becomes dense too!

Is there a clever way to avoid the list generation + naive iteration?

Thanks

Hi @chartl ,

Welcome!

I’m sorry, I do not fully understand your question.

I do understand that when X is a dask array with sparse.COO blocks, then a = X.mean(axis=0) would have a high density, but I would rather expect X < a to have the same density as X . Is this not the case?

Or is the problem that you’re unsatisfied with the speed of the code below (assuming Y also has sparse.COO blocks)?

a = X.mean(axis=0)
counts = Y.T.dot(X < a).compute()

Sparse matrix libraries (sparse being no exception) tend to disallow operations that could induce a dense array. Boolean operations induce dense matrices (since, for positive a, X < a is True almost everywhere).

Try:

Xsp = sparse.COO.from_numpy(np.random.binomial(0.1, 1, size=(100,10)))
L = np.arange(200).reshape((2,100))
R = np.arange(20).reshape((10,2))
(Xsp < 1).dot(R)
L.dot(Xsp < 1)

Note that the first dot product does work for Xsp > 0. At any rate, these operations give errors on a sparse array, so I’m left with columnwise operations, as above, in the form of a list comprehension.

The question is: is there a better alternative to the list comprehension approach, that doesn’t give the errors from this toy code?

Thanks @chartl.

That’s all very clear now.

Sorry, this is my first time using the sparse library and earlier I had created a small (2 by 2) sparse matrix for the operation and that error had not occurred, presumably because it was too small.

You could try the following code snippet. Feel free to ask for clarification if you need it. A very similar example can be found at this link.

import numpy as np
import dask.array as da
import sparse

from functools import reduce


def _dot_column(y, x, a):
    return np.vstack([
        y.T.dot((x[:, i] < a[i]).todense())
        for i in range(x.shape[1])
    ]).T

    
def _sum(x, axis=None, keepdims=None):
    return reduce(np.add, x)


X_dense = da.from_array(np.random.binomial(5, 0.1, size=(100, 10)))
X = X_dense.map_blocks(sparse.COO)
a = X.mean(axis=0)
Y = X.copy()

batch_sz = 2   # tunable parallelism parameter

X = X.rechunk(chunks=(X.chunks[0], batch_sz))
a = a.rechunk(chunks=batch_sz)

counts = da.core.blockwise(
    *(_dot_column, 'ikj'),
    *(Y, 'ik'),
    *(X, 'ij'),
    *(a, 'j'),
    adjust_chunks={'i': 1},
    dtype=Y.dtype,
    meta=np.array([]),
)
counts = da.reduction(
    counts,
    lambda x, axis, keepdims: x,
    _sum,
    axis=0,
    concatenate=False,
    dtype=Y.dtype,
    meta=sparse.COO
)
counts.compute()
3 Likes

I think this makes sense. Just for verification on this syntax, would:

y.dot((x[:,i] < a[i]).todense())  # no transpose on y here
[...]

counts = da.core.blockwise(
  *(_dot_column, 'ijk'),
  *(Y, 'ki'),  # index inverted here
   ...
)

be equivalent to the above?

Yes. I agree that is correct, as long as the number of columns of Y matches the number of rows of X.