XGBRegressor vs DaskXGBRegressor prediction issue

boot329 · February 7, 2024, 3:57pm

Hello all!

I was looking to use DaskXGBRegressor since I am familiar using DaskXGBClassifier with no issues whatsoever - however when I am predicting on a trained DaskXGBRegressor model ~80% of the predictions come back as Nan|Null / when I compare predictions of a similarly setup XGBRegressor (non Dask version) output it makes predictions with no Nan|Null values (same hyperparamters/saved booster used/input data used for training & predicting) -

Looking at the documentation (dask_ml.xgboost.XGBRegressor — dask-ml 2022.5.28 documentation nothing jumps out at me… - does anyone have any tips/directions? Thank you in advance / Really appreciate it!

Sample code:
Dask Version:

import xgboost
import dask.dataframe as dd
from distributed import LocalCluster, Client
cluster = LocalCluster() ## 
client = Client(cluster)

xgb_model_latest = xgboost.dask.DaskXGBRegressor() 
xgb_model_latest.load_model('pretrained_model.json')

columns_used = pd.read_csv('DASK_features.csv')
columns_used = columns_used.iloc[:,1] # format dataframe

# X and y must be Dask dataframes or arrays 
X = X[columns_used.to_list()] # make sure the structure of columns used matches X

xgb_model_latest.client = client # set distributed client to the model

y_pred = xgb_model_latest.predict(X) 
y_pred_regression = y_pred.to_frame(name='forecast')

Non Dask Version

columns_used = pd.read_csv('DASK_features.csv')
columns_used = columns_used.iloc[:,1] # format dataframe

X = X[columns_used.to_list()].compute() # make sure the structure of columns used matches X
hh_id = ddf.THD_HH_ID.compute() # get hh_ids to merge predictions with

xgb_model_latest = xgboost.XGBRegressor()
xgb_model_latest.load_model('pretrained_model.json') ## adhoc testing

y_pred = xgb_model_latest.predict(X)

guillaumeeb · February 9, 2024, 1:20pm

Hi @boot329,

Unfortunately, nothing comes to my mind considering your issue. I’ve taken a look at Distributed XGBoost with Dask, but didn’t find anything that can help you. I’m not sure the doc your pointing to (dask-ml), is relevant and up to date as you are using classes that are directly inside xgboost package.

Do you think you would be able to replicate it with one of the simple example using fake data provided in XGBoost documentation?

Ultimately, since you use an already trained model, you could also use map_blocks or map_partitions function on Dask collections to apply the predictions on regular Pandas or Numpy objects.

Topic		Replies	Views
Errors training xgboost with parquet files on single node Dask DataFrame	3	420	April 28, 2023
Receiving Error :CancelledError: ['dict-c99565de-7b35-489e-9356-82504a139608' Distributed distributed , dask-ml	4	159	September 11, 2023
Aligning LightGBM Dask Classifier predictions with input data Dask DataFrame dask-array , partitioning	3	241	March 8, 2024
Save and Load XGBoost model with mutl label Distributed dask-ml , xgboost	2	1033	March 29, 2022
Is Dask XGBoost a good option	1	80	July 17, 2024

XGBRegressor vs DaskXGBRegressor prediction issue

Related topics