Multi Hot Encoding Dask Array

This is the same question I posted on StackOverflow

Right now I’m performing multi hot encoding in vanilla numpy but I’d like to port the code to dask.

import numpy as np

data = np.array([
    [1, 4, 77, 87, 100, 101, 102, 121],
    [12, 41, 58, 67, 81, 84, 96, 111],
    [31, 33, 35, 50, 60, 70, 92, 99],
])

multihot = np.eye(128, dtype=bool)[data]
multihot = np.logical_or.reduce(multihot, axis=1)

print(multihot.astype(int))

The previous block of code produce this output

[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
  1. Is there a more efficient and dask compatible way to perform this type of conversion?
    (The number of categories is 128 and each row of data is ordered in ascending order)

  2. Can I port this procedure in dask? The problem with this block of code is that dask does not support slicing with lists in multiple axes. Moreover dask.array.Array.vindex does not support indexing with dask objects, you first have to call compute (e.g. da.eye(128, dtype=bool).vindex[data.compute()]) but data is a huge dask array and does not fits in memory.

Hi @S1M0N38,

I don’t know if there is a Dask built-in solution for this, but anyway, as this seem to be embarrassingly parallel, you can always apply a Numpy transformation on every chunk of a Dask Array using map_blocks.

Hope that helps.