I want to create a sparse array with dask without ever creating a dense matrix (at least not in memory).
Here is what I’m trying to accomplish:
Example input:
record = {
0: {
"item1": 1,
"item2": 3
},
1: {"item1": 2,
"item3": 1,
"item4": 50},
}
Correct (but inefficient) output. The real matrix is huge and very sparse.
df = dd.DataFrame.from_dict(record, orient="index", npartitions=10)
df = df.fillna(0).astype(int).compute()
df
item1 item2 item3 item4
0 1 3 0 0
1 2 0 1 50
Dask takes care of not loading everything into memory thanks to being lazy, but now I want
everything to be saved as sparse arrays as well.
df = dd.DataFrame.from_dict(record, orient="index", npartitions=10).fillna(0).astype(int)
sparse_array = df.to_dask_array().map_blocks(sparse.COO)
sparse_array.compute() # => <COO: shape=(2, 4), dtype=int64, nnz=5, fill_value=0>
So far so good, now I just want it back into a nice dataframe:
df_sparse = dd.from_dask_array(sparse_array.compute(), columns=df.columns)
#error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "[path]/.venv/lib/python3.10/site-packages/dask/dataframe/io/io.py", line 519, in from_dask_array
arrays_and_indices = [x.name, "ij" if x.ndim == 2 else "i"]
AttributeError: 'COO' object has no attribute 'name'
Leaving out .compute()
makes it work momentarily.
df_sparse = dd.from_dask_array(sparse_array, columns=df.columns)
df_sparse
Dask DataFrame Structure:
item1 item2 item3 item4
npartitions=2
int64 int64 int64 int64
... ... ... ...
... ... ... ...
Dask Name: from-dask-array, 6 graph layers
#however...
df_sparse.compute()
#error
raise ValueError("DataFrame constructor not properly called!")
ValueError: DataFrame constructor not properly called!
I guess the latter is a result of the name missing as well?
Some google searching mentioned the meta parameter. I tried some variations of
df_sparse = dd.from_dask_array(sparse_array.compute(), meta=dd.utils.make_meta(df))
but no dice.
Any suggestions on what I can try?