Multi Hot Encoding Dask Array

S1M0N38 · December 1, 2022, 11:10am

This is the same question I posted on StackOverflow

Right now I’m performing multi hot encoding in vanilla numpy but I’d like to port the code to dask.

import numpy as np

data = np.array([
    [1, 4, 77, 87, 100, 101, 102, 121],
    [12, 41, 58, 67, 81, 84, 96, 111],
    [31, 33, 35, 50, 60, 70, 92, 99],
])

multihot = np.eye(128, dtype=bool)[data]
multihot = np.logical_or.reduce(multihot, axis=1)

print(multihot.astype(int))

The previous block of code produce this output

[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0
  0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Is there a more efficient and dask compatible way to perform this type of conversion?
(The number of categories is 128 and each row of data is ordered in ascending order)
Can I port this procedure in dask? The problem with this block of code is that dask does not support slicing with lists in multiple axes. Moreover dask.array.Array.vindex does not support indexing with dask objects, you first have to call compute (e.g. da.eye(128, dtype=bool).vindex[data.compute()]) but data is a huge dask array and does not fits in memory.

guillaumeeb · January 16, 2023, 4:39pm

Hi @S1M0N38,

I don’t know if there is a Dask built-in solution for this, but anyway, as this seem to be embarrassingly parallel, you can always apply a Numpy transformation on every chunk of a Dask Array using map_blocks.

Hope that helps.

Topic		Replies	Views
Bilinear interpolation using dask Dask Array	1	575	September 2, 2022
Converting Dask Array to Numpy Array Dask Array	1	2259	January 20, 2023
Dask.array.reduction instead of da.map_blocks? Dask Array dask-array	2	146	June 7, 2024
How to convert a numpy array to a dask array Dask Array	3	210	September 28, 2022
Dask array with pytorch Dask Array	2	434	June 14, 2023

Multi Hot Encoding Dask Array

Related topics