Memory leak in a loop

daskee · April 18, 2023, 8:10am

Hello. I’m trying to train a scikit-learn model using batches.

I splitted data into train and test set but only splitted the indexes; not the actual data because it’s big
In a loop, I get data from train indexes in a batched manner
Every iteration, I preprocess the data and vectorize it
Then I use sklearn’s partial_fit to train the model

But there seems to be a memory leak in this loop because the memory usage doesn’t stop growing. Any tip would be appreciated

This is the loop I use. What could be wrong?
self.data is a dask dataframe. self.train_idx is the indexes for the train samples.

        # train model
        logger.info("Training model...")
        for i in range(0, len(self.train_idx), batch_size):
            # get batch data
            batch = self.data.map_partitions(lambda df: df[df.index.isin(self.train_idx[i:i + batch_size])])

            batch = batch.sample(frac=1, random_state=42)

            # preprocess batch data
            batch.text = batch.map_partitions(lambda df: TextDataPreprocessor().transform(df.text, preprocess_params))

            # vectorize batch data
            x_train = batch.map_partitions(lambda df: self.vectorizer.transform(df.text)).compute()

            self.model.partial_fit(x_train, batch.label, classes=classes)

            preds = self.model.predict(x_test)

            batch_number = (i // batch_size) + 1
            logger.info(
                f"Test score for batch {batch_number}: "
                + f"{sklearn.metrics.f1_score(self.y_test, preds, average='weighted')}")

guillaumeeb · April 19, 2023, 6:12am

Hi @daskee, welcome to Dask Discourse forum!

At a first glance, I’m wondering why you are using your own for loop inside your master process instead of using dask-ml or ScikitLearn with Dask backend?

See Incremental Learning — dask-ml 2022.5.28 documentation.

Topic		Replies	Views
Memory Leak on Dask Worker Distributed	4	564	July 20, 2022
Memory Leakage on single worker on merged DataFrame (after task completion) Dask DataFrame delayed	5	390	October 6, 2023
Partial_fit in dask_ml.wrappers.Incremental crashes memory	5	152	October 14, 2023
Memory leak with `@dask.delayed` Dask DataFrame distributed	3	170	February 2, 2024
Bad performance while training model from SQL data using Dask cluster Distributed distributed , dask-ml	2	47	March 19, 2025

Memory leak in a loop

Related topics