Hi @blackcupcat
Unfortunately, I’m not able to reproduce the error you got.
See example-results I obtained in jupyter
below. As you can see, things seem to work well for me:
In [1]:
import numpy as np
import dask
import dask.array as da
from scipy.sparse import csr_matrix, eye
nrows, ncols = 80000, 138106
x = da.eye(max(nrows, ncols))
In [2]:
%%time
da_x = x[:nrows, :ncols].map_blocks(csr_matrix).persist()
da_x.__repr__()
Out [2]:
CPU times: total: 3min 38s
Wall time: 33.8 s
'dask.array<csr_matrix, shape=(80000, 138106), dtype=float64,
chunksize=(4096, 4096), chunktype=scipy.csr_matrix>'
In [3]:
np_features = np.random.random(ncols).astype(np.bool_) # numpy array
list_features = np_features.tolist() # python list
da_features = da.from_array(np_features) # dask array
da_features.__repr__()
Out [3]:
'dask.array<array, shape=(138106,), dtype=bool, chunksize=(138106,),
chunktype=numpy.ndarray>'
In [4]:
%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
sel_da = da_x[:, da_features].compute()
Out [4]:
CPU times: total: 719 ms
Wall time: 617 ms
In [5]:
%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
sel_np = da_x[:, np_features].compute()
Out [5]:
CPU times: total: 13.2 s
Wall time: 12.2 s
In [6]:
%%time
with dask.config.set(**{'array.slicing.split_large_chunks': True}):
sel_list = da_x[:, list_features].compute()
Out [6]:
CPU times: total: 8.44 s
Wall time: 7.95 s
In [7]:
assert (sel_da - sel_np).nnz == 0
assert (sel_da - sel_list).nnz == 0
For your comparison, I’ve included below my python
version:
> python --version
Python 3.10.4
and a list of the relevant packages currently in my conda
environment:
> conda list "dask|scipy|numpy"
# packages in environment at C:\Users\heamu\anaconda3\envs\daskenv:
#
# Name Version Build Channel
dask 2022.5.0 pyhd8ed1ab_0 conda-forge
dask-core 2022.5.0 pyhd8ed1ab_0 conda-forge
dask-image 2021.12.0 pyhd8ed1ab_0 conda-forge
numpy 1.22.3 py310hed7ac4c_2 conda-forge
scipy 1.8.0 py310h33db832_1 conda-forge