Hey everyone, I’m trying to implement a federated learning simulation with XGBoost. I don’t need any privacy measures (because I’m just simulating federated learning). I was wondering if I could use dask xgboost (Distributed XGBoost with Dask — xgboost 1.6.2 documentation) for this.
Essentially, I want to create a model combining all the data from the different simulated local servers without each local server ever knowing the data from another server. From what I understand, Dask computing between nodes doesn’t share data between nodes. The statistics being communicated through clients are just the gradients and hessians of the tree model. I wanted to see if the same principle could be used for federated learning.
I have split a large dataset into 8 local datasets within my script. I want to create a federated XGBoost model with these 8 datasets. I assume that by using distributed learning, the data itself isn’t shared between the simulated local ‘servers’. Is there some code showing how I can implement this in a python script?
Here’s how I currently have the data setup in my python script.
# dictionaries to hold training and testing data frames
X_training_dictionary = {}
X_testing_dictionary = {}
y_training_dictionary = {}
y_testing_dictionary = {}
# each hosp is considered a simulated local server
for i, hosp in enumerate(data_dictionary):
#data_dictionary was initialized earlier and contains complete data for each local hospital
data = data_dictionary[hosp]
X = data[:, :-1]
Y = data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=rs)
X_training_dictionary[hosp] = X_train
X_testing_dictionary[hosp] = X_test
y_training_dictionary[hosp] = y_train
y_testing_dictionary[hosp] = y_test
Thanks!