Converting scipy sparse csr_matrix to dask array

blackcupcat · April 27, 2022, 11:46pm

We are getting OOM error on scipy sparse csr matrix on slicing the matrix.
So to avoid OOM we are planning on converting this into a dask_array.

We have tried converting it this way : dask.from_array(csr_matrix) but eventually we are getting indexing issues.
Is there another way of converting it ?

ParticularMiner · April 28, 2022, 6:55am

Welcome @blackcupcat !

I would use map_blocks() for the conversion, as suggested in the docs: follow this link.

However you need to be aware that scipy.sparse is not fully supported since its API deviates slightly from numpy’s API on which dask.array is based (see same link above). But as far as array conversion, indexing and slicing are concerned, things should work fine. You may also consider using the sparse (instead of scipy.sparse) package which seems to be fully supported.

import dask.array as da
from scipy.sparse import csr_matrix


x = da.random.random((100, 100), chunks=(10, 10))
x[x < 0.5] = 0
s = x.map_blocks(csr_matrix)
s[5:15, 0:10].compute().toarray()

blackcupcat · May 13, 2022, 1:33am

Hi @ParticularMiner thank you for sharing the documentation and the snippet !

The mentioned snippet seems to be useful for converting the dask chunk to csr_matrix.

But my use-case is slightly different.
We already have a csr_matrix → X of the dimension [8000000, 138106]
We want to convert this to a dask array.

import dask.array as da

print(type(X))
<class 'scipy.sparse.csr.csr_matrix'>

# convert to dask array via some method, here as an example I have used from_array
Y = da.from_array(X)

# Do some operations on Y, an example below: 
Y_op = Y[:, 2:1000]

# Convert back to csr, here I think I can use the map_blocks that you have suggested
Y_op.map_blocks(csr_matrix)

Here the question is about the second line, where we are trying to convert csr_matrix to dask array, is from_array the correct method to use or there are some other methods available ?

ParticularMiner · May 13, 2022, 7:01am

I apologize for the misunderstanding @blackcupcat — I failed to recognize that csr_matrix in your post was not referring to the class in scipy.sparse, but was rather a variable.

In that case, yes: using Y = dask.array.from_array(X), as in your code-snippet is sufficient to do what you want.

But note that Y is then a dask array whose chunk-type is already <class 'scipy.sparse.csr.csr_matrix'>. So is Y_op.

So to convert Y_op back to <class 'scipy.sparse.csr.csr_matrix'> proper, you only need to call Y_op.compute().

The call Y_op.map_blocks(scipy.sparse.csr_matrix) is intended to change the chunk-type of Y_op to <class 'scipy.sparse.csr.csr_matrix'>. But since the chunk-type of Y_op is already <class 'scipy.sparse.csr.csr_matrix'> before the call, then this call does nothing. So it is not necessary.

I hope this answers your question. Feel free to inquire further if not.

blackcupcat · May 17, 2022, 11:19pm

Thanks @ParticularMiner for getting back.
The code as per your suggestion.

import dask.array as da

#features is a boolean array of shape (138106,) 
features = [True False True False .....................True]

Y = da.from_array(X)[:, features]
Y_converted = Y.compute()

But this method call (Y.compute()) is throwing this error - IndexError: too many indices for array


  File "/opt/amazon/lib/python3.6/site-packages/dask/base.py", line 167, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/opt/amazon/lib/python3.6/site-packages/dask/base.py", line 447, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/opt/amazon/lib/python3.6/site-packages/dask/threaded.py", line 84, in get
    **kwargs
  File "/opt/amazon/lib/python3.6/site-packages/dask/local.py", line 486, in get_async
    raise_exception(exc, tb)
  File "/opt/amazon/lib/python3.6/site-packages/dask/local.py", line 316, in reraise
    raise exc
  File "/opt/amazon/lib/python3.6/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/opt/amazon/lib/python3.6/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
IndexError: too many indices for array

ParticularMiner · May 18, 2022, 8:13am

Hi @blackcupcat

Unfortunately, I’m not able to reproduce the error you got.

See example-results I obtained in jupyter below. As you can see, things seem to work well for me:

In [1]:

import numpy as np
import dask
import dask.array as da
from scipy.sparse import csr_matrix, eye


nrows, ncols = 80000, 138106
x = da.eye(max(nrows, ncols))

In [2]:

%%time
da_x = x[:nrows, :ncols].map_blocks(csr_matrix).persist()
da_x.__repr__()

Out [2]:

CPU times: total: 3min 38s
Wall time: 33.8 s

'dask.array<csr_matrix, shape=(80000, 138106), dtype=float64, 
 chunksize=(4096, 4096), chunktype=scipy.csr_matrix>'

In [3]:

np_features = np.random.random(ncols).astype(np.bool_)  # numpy array
list_features = np_features.tolist()  # python list
da_features = da.from_array(np_features)  # dask array
da_features.__repr__()

Out [3]:

'dask.array<array, shape=(138106,), dtype=bool, chunksize=(138106,),
 chunktype=numpy.ndarray>'

In [4]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_da = da_x[:, da_features].compute()

Out [4]:

CPU times: total: 719 ms
Wall time: 617 ms

In [5]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_np = da_x[:, np_features].compute()

Out [5]:

CPU times: total: 13.2 s
Wall time: 12.2 s

In [6]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_list = da_x[:, list_features].compute()

Out [6]:

CPU times: total: 8.44 s
Wall time: 7.95 s

In [7]:

assert (sel_da - sel_np).nnz == 0
assert (sel_da - sel_list).nnz == 0

For your comparison, I’ve included below my python version:

> python --version
Python 3.10.4

and a list of the relevant packages currently in my conda environment:

> conda list "dask|scipy|numpy"
# packages in environment at C:\Users\heamu\anaconda3\envs\daskenv:
#
# Name                    Version                   Build  Channel
dask                      2022.5.0           pyhd8ed1ab_0    conda-forge
dask-core                 2022.5.0           pyhd8ed1ab_0    conda-forge
dask-image                2021.12.0          pyhd8ed1ab_0    conda-forge
numpy                     1.22.3          py310hed7ac4c_2    conda-forge
scipy                     1.8.0           py310h33db832_1    conda-forge

Topic		Replies	Views
Speeding up (indexed) column operations? Dask Array	5	367	March 29, 2022
Confused about working with sparse arrays Dask Array dask-array , sparse	1	721	April 12, 2023
Dataframe from sparse array Dask DataFrame	0	445	August 18, 2022
Most efficient way to copy from Dask array to Numpy Dask Array dask-array	2	44	December 4, 2024
Dask.array.reduction instead of da.map_blocks? Dask Array dask-array	2	126	June 7, 2024

Converting scipy sparse csr_matrix to dask array

Related topics