Speeding up (indexed) column operations?

chartl · March 28, 2022, 8:32pm

Hello dask community,

I am working in a setting where it’s not uncommon to back-end to sparse matrices to save memory. Dask has generally allowed me to avoid doing this by offloading blocks to the filesystem instead. However I’m hitting a point where that’s starting to be a pain point, so I’m switching over to a sparse.COO backend for my blocks.

In the dense world, I have been using matrix products to perform basic tabulation tasks. For instance, identifying class proportions (Y as an (n, c) indicator matrix) associated with features X (a (n,k) data matrix) below their mean value might look like

a = X.mean(axis=0)
counts = Y.T.dot(X < a)

The X < a becomes a problem in the sparse world as it is dense. My drop-in replacement is

counts = np.vstack([Y.T.dot( (X[:, i] < a[i]).compute().todense()) ) 
                    for i in X.shape[1]]).T

I’d like to use da.apply_along_axis but since a[i] is changing with each column there’s an index. And I can’t use X - a since that becomes dense too!

Is there a clever way to avoid the list generation + naive iteration?

Thanks

ParticularMiner · March 29, 2022, 11:54am

Hi @chartl ,

Welcome!

I’m sorry, I do not fully understand your question.

I do understand that when X is a dask array with sparse.COO blocks, then a = X.mean(axis=0) would have a high density, but I would rather expect X < a to have the same density as X . Is this not the case?

Or is the problem that you’re unsatisfied with the speed of the code below (assuming Y also has sparse.COO blocks)?

a = X.mean(axis=0)
counts = Y.T.dot(X < a).compute()

chartl · March 29, 2022, 2:31pm

Sparse matrix libraries (sparse being no exception) tend to disallow operations that could induce a dense array. Boolean operations induce dense matrices (since, for positive a, X < a is True almost everywhere).

Try:

Xsp = sparse.COO.from_numpy(np.random.binomial(0.1, 1, size=(100,10)))
L = np.arange(200).reshape((2,100))
R = np.arange(20).reshape((10,2))
(Xsp < 1).dot(R)
L.dot(Xsp < 1)

Note that the first dot product does work for Xsp > 0. At any rate, these operations give errors on a sparse array, so I’m left with columnwise operations, as above, in the form of a list comprehension.

The question is: is there a better alternative to the list comprehension approach, that doesn’t give the errors from this toy code?

ParticularMiner · March 29, 2022, 5:33pm

Thanks @chartl.

That’s all very clear now.

Sorry, this is my first time using the sparse library and earlier I had created a small (2 by 2) sparse matrix for the operation and that error had not occurred, presumably because it was too small.

You could try the following code snippet. Feel free to ask for clarification if you need it. A very similar example can be found at this link.

import numpy as np
import dask.array as da
import sparse

from functools import reduce


def _dot_column(y, x, a):
    return np.vstack([
        y.T.dot((x[:, i] < a[i]).todense())
        for i in range(x.shape[1])
    ]).T

    
def _sum(x, axis=None, keepdims=None):
    return reduce(np.add, x)


X_dense = da.from_array(np.random.binomial(5, 0.1, size=(100, 10)))
X = X_dense.map_blocks(sparse.COO)
a = X.mean(axis=0)
Y = X.copy()

batch_sz = 2   # tunable parallelism parameter

X = X.rechunk(chunks=(X.chunks[0], batch_sz))
a = a.rechunk(chunks=batch_sz)

counts = da.core.blockwise(
    *(_dot_column, 'ikj'),
    *(Y, 'ik'),
    *(X, 'ij'),
    *(a, 'j'),
    adjust_chunks={'i': 1},
    dtype=Y.dtype,
    meta=np.array([]),
)
counts = da.reduction(
    counts,
    lambda x, axis, keepdims: x,
    _sum,
    axis=0,
    concatenate=False,
    dtype=Y.dtype,
    meta=sparse.COO
)
counts.compute()

chartl · March 29, 2022, 5:45pm

I think this makes sense. Just for verification on this syntax, would:

y.dot((x[:,i] < a[i]).todense())  # no transpose on y here
[...]

counts = da.core.blockwise(
  *(_dot_column, 'ijk'),
  *(Y, 'ki'),  # index inverted here
   ...
)

be equivalent to the above?

ParticularMiner · March 29, 2022, 5:49pm

Yes. I agree that is correct, as long as the number of columns of Y matches the number of rows of X.

Topic		Replies	Views
Dask slower than numpy Dask Array	1	365	August 23, 2022
Saving large dask arrays one block at a time, without first persisting in memory Dask Array dask-array , distributed	2	859	April 27, 2023
Dataframe from sparse array Dask DataFrame	0	451	August 18, 2022
Confused about working with sparse arrays Dask Array dask-array , sparse	1	736	April 12, 2023
Indexing a dask array with a boolean array Dask Array dask-array	2	1707	May 19, 2022

Speeding up (indexed) column operations?

Related topics