XGBRegressor vs DaskXGBRegressor prediction issue

Hello all!

I was looking to use DaskXGBRegressor since I am familiar using DaskXGBClassifier with no issues whatsoever - however when I am predicting on a trained DaskXGBRegressor model ~80% of the predictions come back as Nan|Null / when I compare predictions of a similarly setup XGBRegressor (non Dask version) output it makes predictions with no Nan|Null values (same hyperparamters/saved booster used/input data used for training & predicting) -

Looking at the documentation (dask_ml.xgboost.XGBRegressor — dask-ml 2022.5.28 documentation nothing jumps out at me… - does anyone have any tips/directions? Thank you in advance / Really appreciate it!

Sample code:
Dask Version:

import xgboost
import dask.dataframe as dd
from distributed import LocalCluster, Client
cluster = LocalCluster() ## 
client = Client(cluster)

xgb_model_latest = xgboost.dask.DaskXGBRegressor() 
xgb_model_latest.load_model('pretrained_model.json')

columns_used = pd.read_csv('DASK_features.csv')
columns_used = columns_used.iloc[:,1] # format dataframe

# X and y must be Dask dataframes or arrays 
X = X[columns_used.to_list()] # make sure the structure of columns used matches X

xgb_model_latest.client = client # set distributed client to the model

y_pred = xgb_model_latest.predict(X) 
y_pred_regression = y_pred.to_frame(name='forecast')

Non Dask Version

columns_used = pd.read_csv('DASK_features.csv')
columns_used = columns_used.iloc[:,1] # format dataframe

X = X[columns_used.to_list()].compute() # make sure the structure of columns used matches X
hh_id = ddf.THD_HH_ID.compute() # get hh_ids to merge predictions with

xgb_model_latest = xgboost.XGBRegressor()
xgb_model_latest.load_model('pretrained_model.json') ## adhoc testing

y_pred = xgb_model_latest.predict(X) 

Hi @boot329,

Unfortunately, nothing comes to my mind considering your issue. I’ve taken a look at Distributed XGBoost with Dask, but didn’t find anything that can help you. I’m not sure the doc your pointing to (dask-ml), is relevant and up to date as you are using classes that are directly inside xgboost package.

Do you think you would be able to replicate it with one of the simple example using fake data provided in XGBoost documentation?

Ultimately, since you use an already trained model, you could also use map_blocks or map_partitions function on Dask collections to apply the predictions on regular Pandas or Numpy objects.