I have been using this code to train/persist, and retrain a model using XGboost. Since the predictions is multi label, I am using MultiOutputClassifier from sklearn.multioutput:
if os.path.isfile(model_file_name):
print(f'loading model from file {model_file_name}')
model = joblib.load(model_file_name)
else:
print(f'No model file found. creating new model {model_file_name}')
objective = 'binary:logistic'
model = XGBClassifier(objective=objective,
eval_metric='mlogloss')
model = MultiOutputClassifier(model)
print('started training')
model.fit(train, labels)
print('saving model')
joblib.dump(model, model_file_name)
Now that the training is taking a bit long I would like to take advantage of dask distributed to run the training and predictions on multiple machines if possible. The main problems I am facing is:
- Alternative for MultiOutputClassifier in Dask.
- The ability to save and load a model into and from a local file.
I was able to find the current snipping on XGBoost docs for sklearn interface, but nothing related to the multi output.
clf = xgb.dask.DaskXGBClassifier(n_estimators=100, tree_method="hist")
clf.client = client # assign the client
clf.fit(X, y, eval_set=[(X, y)])
proba = clf.predict_proba(X)
Assuming the resulting clf is an sklearn model, then saving and loading should be transparent. The remaining question is how to get MultiOutput classfier under dask ?
Thank you in advance