Memory leak in a loop

Hello. I’m trying to train a scikit-learn model using batches.

  1. I splitted data into train and test set but only splitted the indexes; not the actual data because it’s big
  2. In a loop, I get data from train indexes in a batched manner
  3. Every iteration, I preprocess the data and vectorize it
  4. Then I use sklearn’s partial_fit to train the model

But there seems to be a memory leak in this loop because the memory usage doesn’t stop growing. Any tip would be appreciated

This is the loop I use. What could be wrong?
self.data is a dask dataframe. self.train_idx is the indexes for the train samples.

        # train model
        logger.info("Training model...")
        for i in range(0, len(self.train_idx), batch_size):
            # get batch data
            batch = self.data.map_partitions(lambda df: df[df.index.isin(self.train_idx[i:i + batch_size])])

            batch = batch.sample(frac=1, random_state=42)

            # preprocess batch data
            batch.text = batch.map_partitions(lambda df: TextDataPreprocessor().transform(df.text, preprocess_params))

            # vectorize batch data
            x_train = batch.map_partitions(lambda df: self.vectorizer.transform(df.text)).compute()

            self.model.partial_fit(x_train, batch.label, classes=classes)

            preds = self.model.predict(x_test)

            batch_number = (i // batch_size) + 1
            logger.info(
                f"Test score for batch {batch_number}: "
                + f"{sklearn.metrics.f1_score(self.y_test, preds, average='weighted')}")

Hi @daskee, welcome to Dask Discourse forum!

At a first glance, I’m wondering why you are using your own for loop inside your master process instead of using dask-ml or ScikitLearn with Dask backend?

See Incremental Learning — dask-ml 2022.5.28 documentation.