Save and Load XGBoost model with mutl label

malakeel · March 27, 2022, 4:22pm

I have been using this code to train/persist, and retrain a model using XGboost. Since the predictions is multi label, I am using MultiOutputClassifier from sklearn.multioutput:



if os.path.isfile(model_file_name):
    print(f'loading model from file {model_file_name}')
    model = joblib.load(model_file_name)
else:
    print(f'No model file found. creating new model {model_file_name}')
    objective = 'binary:logistic'
    model = XGBClassifier(objective=objective,
                          eval_metric='mlogloss')
    model = MultiOutputClassifier(model)


print('started training')
model.fit(train, labels)
print('saving model')
joblib.dump(model, model_file_name)

Now that the training is taking a bit long I would like to take advantage of dask distributed to run the training and predictions on multiple machines if possible. The main problems I am facing is:

Alternative for MultiOutputClassifier in Dask.
The ability to save and load a model into and from a local file.

I was able to find the current snipping on XGBoost docs for sklearn interface, but nothing related to the multi output.

    clf = xgb.dask.DaskXGBClassifier(n_estimators=100, tree_method="hist")
    clf.client = client  # assign the client
    clf.fit(X, y, eval_set=[(X, y)])
    proba = clf.predict_proba(X)

Assuming the resulting clf is an sklearn model, then saving and loading should be transparent. The remaining question is how to get MultiOutput classfier under dask ?

Thank you in advance

scharlottej13 · March 28, 2022, 11:30pm

Thanks for the question @malakeel!

You can use joblib with MultiOutputClassifier since there is the n_jobs parameter available. The use should be quite similar to this example, except you’ll swap out RandomizedSearchCV with your classifier. Happy to work with you on the exact syntax if you’d like to share a minimal reproducer!

malakeel · March 29, 2022, 11:25pm

Thank you a lot Sarah. I think I found my way around it. I had some issues with Version Mismatch with the other machine, but it worked fine on a single worker.

Topic		Replies	Views
Simulating Federated Learning of XGBoost with Dask (simulating local servers) Dask DataFrame xgboost	3	735	February 25, 2022
Basic question about parallelizing different model fits with dask(-ml) Distributed dask-ml	0	190	November 17, 2022
DaskLGBMClassifier and Hypertuning using RandomizedSearchCV with DASK ECS Fargate Cluster Dask DataFrame future , distributed , dask-ml	2	567	March 23, 2023
LightGBM Distributed Training Distributed dask-array , distributed , dask-ml	2	23	March 9, 2025
Meta-Estimators with Multiple Models dask-ml	2	202	June 9, 2022

Save and Load XGBoost model with mutl label

Related topics