Calculating Area Under the Curve (AUC) using dask

Kaegan · October 31, 2022, 3:09pm

Hello Dask Community,

I was wondering if anyone knows of an implementation calculating AUC using dask. I have the predicted probability on a test set of data (a distributed dask dataframe or array) computed from a logistic regression using dask-ml and now I would like to calculate AUC. I figured it was worth asking if anyone has done this before or some challenges you might see in writing this myself.

Thank you in advance for the help!

niu · November 3, 2022, 2:43pm

after dask-ml, your results should be quite a small and manageable file. you can just switch to pandas and other regular packages to calculate AUC.

Kaegan · November 3, 2022, 7:51pm

Hi Niu,

Thank you so much for your response! If my test dataset is small enough this approach would be fine but I am wondering if this could work when my test dataset is too large to fit into memory. One thing I have thought about is trying to calculating the ROC curve using dask arrays then using the results from the curve which could be smaller and likely fit into memory and use sklearn to calculate AUC from there. Is that kind of what you meant?

Thanks again!

niu · November 3, 2022, 10:15pm

Calculating the ROC curve involves sorting (python - How to read this ROC curve and set custom thresholds? - Stack Overflow). I don’t think that’s something Dask (or any distributed structure) is good at. Basically, after prediction, you probably have y_predicted_probablity (float), y_predicted_label (small int), y_actual_label (small int), maybe something more if I forgot. You don’t need any other columns. So fitting these few columns into a Pandas dataframe and doing it in sklearn should be fine for most cases. It’s hard to imagine this wouldn’t work even if the data is really really large. In that case, I don’t think you should plot ROC curve.

Kaegan · November 8, 2022, 2:21pm

Hi Niu,

Sorry it’s been a few days I was caught up with work. I appreciate your response! It sounds like I was overthinking the problem and assuming everything had to be done in dask and be distributed. What you have said sounds reasonable to me.

Thank you for your help!

Kaegan · January 11, 2023, 7:30pm

For anyone else wondering the same thing, I forgot to revisit this but this has worked for me even with a dataset of roughly 9 million rows. Marking as answered now, thanks again for the help @niu!!!

Topic		Replies	Views
Performance of Dask DataFrames for Feature Engineering Dask DataFrame	9	1168	March 2, 2023
DaskLGBMClassifier and Hypertuning using RandomizedSearchCV with DASK ECS Fargate Cluster Dask DataFrame future , distributed , dask-ml	2	568	March 23, 2023
Cannot calculate simple .mean() on dask.dataframe larger than RAM Dask DataFrame	2	440	January 16, 2023
Is Dask XGBoost a good option	1	65	July 17, 2024
Perform the same operation on all columns of a dask dataframe in parallel Dask DataFrame delayed , distributed , dask-ml	5	216	November 10, 2022

Calculating Area Under the Curve (AUC) using dask

Related topics