Out of memory when using sklearn to compute score

When I used sklearn to train a random forest

with joblib.parallel_backend('dask'):
    rf.fit(xbb_drop_dup_onehot, pseudo_label)

it is ok. But when I used sklearn to compute score on the same data, my memory was used out and my kernel was closed.

with joblib.parallel_backend('dask'):
    score = rf.score(xbb_drop_dup_onehot, pseudo_label)

Hi @Wesady, welcome to Dask community!

How big is your input dataset? I agree this is really weird to have the score function failing like that. Do you need the joblib Context Manager to compute it?

How much memory do you have on your machine?

You could also try to use a LocalCluster and monitor the progress of the call on the dashboard, but I’m still not sure if the score function is benefiting from joblib.

Thank you for your reply!
My variable xbb_drop_dup_onehot’s shape is (40750, 667) and all features have been one-hot encoded. My cluster has 800G memory and I found that they were used out when I computed the score. When fitting random forest I can find that there is very little memory being used on the dashboard, but when I compute the score the memory of workers has not been used but I can find that my memory is used out by using htop.

So the problem is on your Client machine? How much memory do you have in there?

There are 800G. And the problem is I feel that dask clients are not used when computing score based on my above observation.

I think we need more details on your overall implementation. It would be really helpful to see your complete workflow, or better than that, having a reproducer.

I’m really not sure that the score method of sklearn can be natively parallelized using joblib.

What type is the xbb_drop_dup_onehot variable?

Oh, may score method of sklearn can not be natively parallelized using joblib. xbb_drop_dup_onehot is just a onehot encoded matrix.

Well, I have no clue here, without a reproducer or a more complete workflow it’s hard to tell.

The xbb_drop_dup_onehot variable is not so big. Your model probably isn’t too. That should be pretty fast to compute the score on a single machine.

Did you try without using Dask at all?