Out of memory when using sklearn to compute score

Wesady · April 2, 2024, 2:10am

When I used sklearn to train a random forest

with joblib.parallel_backend('dask'):
    rf.fit(xbb_drop_dup_onehot, pseudo_label)

it is ok. But when I used sklearn to compute score on the same data, my memory was used out and my kernel was closed.

with joblib.parallel_backend('dask'):
    score = rf.score(xbb_drop_dup_onehot, pseudo_label)

guillaumeeb · April 2, 2024, 2:15pm

Hi @Wesady, welcome to Dask community!

How big is your input dataset? I agree this is really weird to have the score function failing like that. Do you need the joblib Context Manager to compute it?

How much memory do you have on your machine?

You could also try to use a LocalCluster and monitor the progress of the call on the dashboard, but I’m still not sure if the score function is benefiting from joblib.

Wesady · April 3, 2024, 7:37am

Thank you for your reply!
My variable xbb_drop_dup_onehot’s shape is (40750, 667) and all features have been one-hot encoded. My cluster has 800G memory and I found that they were used out when I computed the score. When fitting random forest I can find that there is very little memory being used on the dashboard, but when I compute the score the memory of workers has not been used but I can find that my memory is used out by using htop.

guillaumeeb · April 4, 2024, 3:34pm

So the problem is on your Client machine? How much memory do you have in there?

Wesady · April 5, 2024, 2:03am

There are 800G. And the problem is I feel that dask clients are not used when computing score based on my above observation.

guillaumeeb · April 5, 2024, 6:57am

I think we need more details on your overall implementation. It would be really helpful to see your complete workflow, or better than that, having a reproducer.

I’m really not sure that the score method of sklearn can be natively parallelized using joblib.

What type is the xbb_drop_dup_onehot variable?

Wesady · April 6, 2024, 2:21am

Oh, may score method of sklearn can not be natively parallelized using joblib. xbb_drop_dup_onehot is just a onehot encoded matrix.

guillaumeeb · April 11, 2024, 7:42am

Well, I have no clue here, without a reproducer or a more complete workflow it’s hard to tell.

The xbb_drop_dup_onehot variable is not so big. Your model probably isn’t too. That should be pretty fast to compute the score on a single machine.

Did you try without using Dask at all?

Topic		Replies	Views
How ot train a large sklearn.ensemble.RandomForestRegressor on multi-nodes HPC Distributed distributed	5	545	August 25, 2022
Memory Management of Dask Cluster and a few new user questions Distributed distributed	15	1073	March 13, 2024
Dask computation takes way too much memory Dask DataFrame distributed	5	277	December 27, 2023
Memory leak in a loop Dask DataFrame	1	161	April 19, 2023
Predict the memory when use dask distributed schedular Distributed distributed	1	171	January 20, 2022

Out of memory when using sklearn to compute score

Related Topics