Aligning LightGBM Dask Classifier predictions with input data

I’m using LightGBM’s DaskClassifier to train then predict on large data frames. I successfully train and generate predictions (they come out as Dask Array). I hit a wall when I want to filter the inference input data X (Dask.DataFrame) to keep records where prediction > 0. I get the error(s) below.

*More generally, I want to generate an array of len(X) (e.g. da.ones(len(X))) then use it as a boolean mask against X. How can I get this to work? X.loc[da.full(len(X), True)]

ValueError: The index and array have different numbers of blocks. (5 != 1)

I’ve tried several things to fix this.

  1. Repartition predictions to have same number as input data - but the partitions aren’t the exact same. I get this error:
ValueError: Length of values (336201) does not match length of index (331980)
  1. Try using map_partitions to align the data frame and predictions series (dd.from_dask_array(predictions)). I get this error, which I don’t understand:
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. If you don't want the partitions to be aligned, and are calling `map_partitions` directly, pass `align_dataframes=False`.
  1. ChatGPT told me to fix the unknown divisions error with reset index, which I run both on input X DataFrame and predictions Series.
Exception: "IndexingError('Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).')"
  1. I’ve also tried dd.concat([X, y], axis=1) and get the same error sequence.

  2. Tried y = y.repartition(divisions=X.divisions):

ValueError: right side of old and new divisions are different.

UPDATE AFTER FINDING A WORKAROUND:

Additional context to the problem: The X given was read from CSV. The CSV is initialized without divisions, and I found a post somewhere that said to set them by calling reset_index().set_index("index"), which created a partitioned index over the CSV → this felt like a rabbit hole.

To simplify, the problem we were trying to solve was:

  • read a csv
  • make a binary prediction on all the rows
  • filter the rows based on the prediction

We predicted on all the rows and once and couldn’t map the predictions back to the csv in order to filter it.

We instead used y = X.map_partitions(model.predict) to predict on each partition. Shoulda thought of this earlier.

Would still want to know if a randomly generated boolean array can be used to filter a csv using Dask.

Hi @edward-cates, welcome to Dask community!

It seems you found a solution and I’m glad for it.

So if I understand correctly, in the first place, when you generate predictions on the full X input dataframe, you obtain a prediction which is a a Dask Array but with a different partitionning than X?

Boolean indexing should work if your array and Dataframe had the same shape, which is obviously not the case, do you know why?

I don’t think you can use boolean indexing on a Dataframe with an Array or Dataframe with different shapes.

Hey! Yes, that’s right - the y_pred returned from LGBM’s DaskClassifier has same length but different partitioning than the input X_test. Which I guess seems like an issue I should maybe report to LGBM.

So out of curiosity here, if I have a partitioned Dask DataFrame X and make a random array of same length (da.random.randint(low=0, high=2, size=len(X))), there’s no way to use this array as a mask on X?

No, I don’t think so. You would need to have the same chunking, so number of partitions and partitions sizes.

And even if you have these, with a quick look on the das.array.random package documentation, I don’t see a way to give it to hte various methods here.

1 Like