This is the same question I posted on StackOverflow
Right now I’m performing multi hot encoding in vanilla numpy but I’d like to port the code to dask.
import numpy as np
data = np.array([
[1, 4, 77, 87, 100, 101, 102, 121],
[12, 41, 58, 67, 81, 84, 96, 111],
[31, 33, 35, 50, 60, 70, 92, 99],
])
multihot = np.eye(128, dtype=bool)[data]
multihot = np.logical_or.reduce(multihot, axis=1)
print(multihot.astype(int))
The previous block of code produce this output
[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
-
Is there a more efficient and dask compatible way to perform this type of conversion?
(The number of categories is 128 and each row ofdata
is ordered in ascending order) -
Can I port this procedure in dask? The problem with this block of code is that dask does not support slicing with lists in multiple axes. Moreover dask.array.Array.vindex does not support indexing with dask objects, you first have to call compute (e.g.
da.eye(128, dtype=bool).vindex[data.compute()]
) butdata
is a huge dask array and does not fits in memory.