Dataframe from sparse array

anteuh · August 18, 2022, 9:52am

I want to create a sparse array with dask without ever creating a dense matrix (at least not in memory).
Here is what I’m trying to accomplish:
Example input:

record = {
            0: {
                "item1": 1,
                "item2": 3
            },
            1: {"item1": 2,
                "item3": 1,
                "item4": 50},
        }

Correct (but inefficient) output. The real matrix is huge and very sparse.

df = dd.DataFrame.from_dict(record, orient="index", npartitions=10)
df = df.fillna(0).astype(int).compute()
df
   item1  item2  item3  item4
0      1      3      0      0
1      2      0      1     50

Dask takes care of not loading everything into memory thanks to being lazy, but now I want
everything to be saved as sparse arrays as well.

df = dd.DataFrame.from_dict(record, orient="index", npartitions=10).fillna(0).astype(int)
sparse_array = df.to_dask_array().map_blocks(sparse.COO)
sparse_array.compute() # => <COO: shape=(2, 4), dtype=int64, nnz=5, fill_value=0>

So far so good, now I just want it back into a nice dataframe:

df_sparse = dd.from_dask_array(sparse_array.compute(), columns=df.columns)
   #error:
    Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "[path]/.venv/lib/python3.10/site-packages/dask/dataframe/io/io.py", line 519, in from_dask_array
    arrays_and_indices = [x.name, "ij" if x.ndim == 2 else "i"]
AttributeError: 'COO' object has no attribute 'name'

Leaving out .compute() makes it work momentarily.

df_sparse = dd.from_dask_array(sparse_array, columns=df.columns)
df_sparse
Dask DataFrame Structure:
               item1  item2  item3  item4
npartitions=2                            
               int64  int64  int64  int64
                 ...    ...    ...    ...
                 ...    ...    ...    ...
Dask Name: from-dask-array, 6 graph layers

#however...
df_sparse.compute()
#error
    raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!

I guess the latter is a result of the name missing as well?
Some google searching mentioned the meta parameter. I tried some variations of

    df_sparse = dd.from_dask_array(sparse_array.compute(), meta=dd.utils.make_meta(df))

but no dice.

Any suggestions on what I can try?

Topic		Replies	Views
Converting scipy sparse csr_matrix to dask array Dask Array dask-array , dask-ml	5	2655	May 18, 2022
Create an numpy array from dask dataframe Dask DataFrame	1	1671	August 31, 2022
Confused about working with sparse arrays Dask Array dask-array , sparse	1	762	April 12, 2023
Constructing a sparse dask array from Numpy arrays Dask Array	1	186	September 7, 2023
Speeding up (indexed) column operations? Dask Array	5	372	March 29, 2022

Dataframe from sparse array

Related topics