I’m using LightGBM’s DaskClassifier to train then predict on large data frames. I successfully train and generate predictions (they come out as Dask Array). I hit a wall when I want to filter the inference input data X (Dask.DataFrame) to keep records where prediction > 0
. I get the error(s) below.
*More generally, I want to generate an array of len(X)
(e.g. da.ones(len(X))
) then use it as a boolean mask against X
. How can I get this to work? X.loc[da.full(len(X), True)]
ValueError: The index and array have different numbers of blocks. (5 != 1)
I’ve tried several things to fix this.
- Repartition predictions to have same number as input data - but the partitions aren’t the exact same. I get this error:
ValueError: Length of values (336201) does not match length of index (331980)
- Try using
map_partitions
to align the data frame and predictions series (dd.from_dask_array(predictions)
). I get this error, which I don’t understand:
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. If you don't want the partitions to be aligned, and are calling `map_partitions` directly, pass `align_dataframes=False`.
- ChatGPT told me to fix the unknown divisions error with reset index, which I run both on input X DataFrame and predictions Series.
Exception: "IndexingError('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).')"
-
I’ve also tried
dd.concat([X, y], axis=1)
and get the same error sequence. -
Tried
y = y.repartition(divisions=X.divisions)
:
ValueError: right side of old and new divisions are different.
UPDATE AFTER FINDING A WORKAROUND:
Additional context to the problem: The X
given was read from CSV. The CSV is initialized without divisions, and I found a post somewhere that said to set them by calling reset_index().set_index("index")
, which created a partitioned index over the CSV → this felt like a rabbit hole.
To simplify, the problem we were trying to solve was:
- read a csv
- make a binary prediction on all the rows
- filter the rows based on the prediction
We predicted on all the rows and once and couldn’t map the predictions back to the csv in order to filter it.
We instead used y = X.map_partitions(model.predict)
to predict on each partition. Shoulda thought of this earlier.
Would still want to know if a randomly generated boolean array can be used to filter a csv using Dask.