Speeding up (indexed) column operations?

ParticularMiner · March 29, 2022, 5:33pm

That’s all very clear now.

Sorry, this is my first time using the sparse library and earlier I had created a small (2 by 2) sparse matrix for the operation and that error had not occurred, presumably because it was too small.

You could try the following code snippet. Feel free to ask for clarification if you need it. A very similar example can be found at this link.

import numpy as np
import dask.array as da
import sparse

from functools import reduce


def _dot_column(y, x, a):
    return np.vstack([
        y.T.dot((x[:, i] < a[i]).todense())
        for i in range(x.shape[1])
    ]).T

    
def _sum(x, axis=None, keepdims=None):
    return reduce(np.add, x)


X_dense = da.from_array(np.random.binomial(5, 0.1, size=(100, 10)))
X = X_dense.map_blocks(sparse.COO)
a = X.mean(axis=0)
Y = X.copy()

batch_sz = 2   # tunable parallelism parameter

X = X.rechunk(chunks=(X.chunks[0], batch_sz))
a = a.rechunk(chunks=batch_sz)

counts = da.core.blockwise(
    *(_dot_column, 'ikj'),
    *(Y, 'ik'),
    *(X, 'ij'),
    *(a, 'j'),
    adjust_chunks={'i': 1},
    dtype=Y.dtype,
    meta=np.array([]),
)
counts = da.reduction(
    counts,
    lambda x, axis, keepdims: x,
    _sum,
    axis=0,
    concatenate=False,
    dtype=Y.dtype,
    meta=sparse.COO
)
counts.compute()

Topic		Replies	Views
Dask slower than numpy Dask Array	1	368	August 23, 2022
Saving large dask arrays one block at a time, without first persisting in memory Dask Array dask-array , distributed	2	868	April 27, 2023
Dataframe from sparse array Dask DataFrame	0	456	August 18, 2022
Confused about working with sparse arrays Dask Array dask-array , sparse	1	749	April 12, 2023
Indexing a dask array with a boolean array Dask Array dask-array	2	1726	May 19, 2022

Speeding up (indexed) column operations?

Related topics