Converting scipy sparse csr_matrix to dask array

ParticularMiner · May 18, 2022, 8:13am

Unfortunately, I’m not able to reproduce the error you got.

See example-results I obtained in jupyter below. As you can see, things seem to work well for me:

In [1]:

import numpy as np
import dask
import dask.array as da
from scipy.sparse import csr_matrix, eye


nrows, ncols = 80000, 138106
x = da.eye(max(nrows, ncols))

In [2]:

%%time
da_x = x[:nrows, :ncols].map_blocks(csr_matrix).persist()
da_x.__repr__()

Out [2]:

CPU times: total: 3min 38s
Wall time: 33.8 s

'dask.array<csr_matrix, shape=(80000, 138106), dtype=float64, 
 chunksize=(4096, 4096), chunktype=scipy.csr_matrix>'

In [3]:

np_features = np.random.random(ncols).astype(np.bool_)  # numpy array
list_features = np_features.tolist()  # python list
da_features = da.from_array(np_features)  # dask array
da_features.__repr__()

Out [3]:

'dask.array<array, shape=(138106,), dtype=bool, chunksize=(138106,),
 chunktype=numpy.ndarray>'

In [4]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_da = da_x[:, da_features].compute()

Out [4]:

CPU times: total: 719 ms
Wall time: 617 ms

In [5]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_np = da_x[:, np_features].compute()

Out [5]:

CPU times: total: 13.2 s
Wall time: 12.2 s

In [6]:

%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    sel_list = da_x[:, list_features].compute()

Out [6]:

CPU times: total: 8.44 s
Wall time: 7.95 s

In [7]:

assert (sel_da - sel_np).nnz == 0
assert (sel_da - sel_list).nnz == 0

For your comparison, I’ve included below my python version:

> python --version
Python 3.10.4

and a list of the relevant packages currently in my conda environment:

> conda list "dask|scipy|numpy"
# packages in environment at C:\Users\heamu\anaconda3\envs\daskenv:
#
# Name                    Version                   Build  Channel
dask                      2022.5.0           pyhd8ed1ab_0    conda-forge
dask-core                 2022.5.0           pyhd8ed1ab_0    conda-forge
dask-image                2021.12.0          pyhd8ed1ab_0    conda-forge
numpy                     1.22.3          py310hed7ac4c_2    conda-forge
scipy                     1.8.0           py310h33db832_1    conda-forge

Topic		Replies	Views
Speeding up (indexed) column operations? Dask Array	5	367	March 29, 2022
Confused about working with sparse arrays Dask Array dask-array , sparse	1	721	April 12, 2023
Dataframe from sparse array Dask DataFrame	0	445	August 18, 2022
Most efficient way to copy from Dask array to Numpy Dask Array dask-array	2	44	December 4, 2024
Dask.array.reduction instead of da.map_blocks? Dask Array dask-array	2	126	June 7, 2024

Converting scipy sparse csr_matrix to dask array

Related topics