Hello. I’m trying to train a scikit-learn model using batches.
- I splitted data into train and test set but only splitted the indexes; not the actual data because it’s big
- In a loop, I get data from train indexes in a batched manner
- Every iteration, I preprocess the data and vectorize it
- Then I use sklearn’s
partial_fit
to train the model
But there seems to be a memory leak in this loop because the memory usage doesn’t stop growing. Any tip would be appreciated
This is the loop I use. What could be wrong?
self.data
is a dask dataframe. self.train_idx
is the indexes for the train samples.
# train model
logger.info("Training model...")
for i in range(0, len(self.train_idx), batch_size):
# get batch data
batch = self.data.map_partitions(lambda df: df[df.index.isin(self.train_idx[i:i + batch_size])])
batch = batch.sample(frac=1, random_state=42)
# preprocess batch data
batch.text = batch.map_partitions(lambda df: TextDataPreprocessor().transform(df.text, preprocess_params))
# vectorize batch data
x_train = batch.map_partitions(lambda df: self.vectorizer.transform(df.text)).compute()
self.model.partial_fit(x_train, batch.label, classes=classes)
preds = self.model.predict(x_test)
batch_number = (i // batch_size) + 1
logger.info(
f"Test score for batch {batch_number}: "
+ f"{sklearn.metrics.f1_score(self.y_test, preds, average='weighted')}")