Save and Load XGBoost model with mutl label

I have been using this code to train/persist, and retrain a model using XGboost. Since the predictions is multi label, I am using MultiOutputClassifier from sklearn.multioutput:

if os.path.isfile(model_file_name):
    print(f'loading model from file {model_file_name}')
    model = joblib.load(model_file_name)
    print(f'No model file found. creating new model {model_file_name}')
    objective = 'binary:logistic'
    model = XGBClassifier(objective=objective,
    model = MultiOutputClassifier(model)

print('started training'), labels)
print('saving model')
joblib.dump(model, model_file_name)

Now that the training is taking a bit long I would like to take advantage of dask distributed to run the training and predictions on multiple machines if possible. The main problems I am facing is:

  • Alternative for MultiOutputClassifier in Dask.
  • The ability to save and load a model into and from a local file.

I was able to find the current snipping on XGBoost docs for sklearn interface, but nothing related to the multi output.

    clf = xgb.dask.DaskXGBClassifier(n_estimators=100, tree_method="hist")
    clf.client = client  # assign the client, y, eval_set=[(X, y)])
    proba = clf.predict_proba(X)

Assuming the resulting clf is an sklearn model, then saving and loading should be transparent. The remaining question is how to get MultiOutput classfier under dask ?

Thank you in advance

Thanks for the question @malakeel!

You can use joblib with MultiOutputClassifier since there is the n_jobs parameter available. The use should be quite similar to this example, except you’ll swap out RandomizedSearchCV with your classifier. Happy to work with you on the exact syntax if you’d like to share a minimal reproducer!


Thank you a lot Sarah. I think I found my way around it. I had some issues with Version Mismatch with the other machine, but it worked fine on a single worker.