Calculating Area Under the Curve (AUC) using dask

Hello Dask Community,

I was wondering if anyone knows of an implementation calculating AUC using dask. I have the predicted probability on a test set of data (a distributed dask dataframe or array) computed from a logistic regression using dask-ml and now I would like to calculate AUC. I figured it was worth asking if anyone has done this before or some challenges you might see in writing this myself.

Thank you in advance for the help!

after dask-ml, your results should be quite a small and manageable file. you can just switch to pandas and other regular packages to calculate AUC.

Hi Niu,

Thank you so much for your response! If my test dataset is small enough this approach would be fine but I am wondering if this could work when my test dataset is too large to fit into memory. One thing I have thought about is trying to calculating the ROC curve using dask arrays then using the results from the curve which could be smaller and likely fit into memory and use sklearn to calculate AUC from there. Is that kind of what you meant?

Thanks again!

Calculating the ROC curve involves sorting (python - How to read this ROC curve and set custom thresholds? - Stack Overflow). I don’t think that’s something Dask (or any distributed structure) is good at. Basically, after prediction, you probably have y_predicted_probablity (float), y_predicted_label (small int), y_actual_label (small int), maybe something more if I forgot. You don’t need any other columns. So fitting these few columns into a Pandas dataframe and doing it in sklearn should be fine for most cases. It’s hard to imagine this wouldn’t work even if the data is really really large. In that case, I don’t think you should plot ROC curve.

Hi Niu,

Sorry it’s been a few days I was caught up with work. I appreciate your response! It sounds like I was overthinking the problem and assuming everything had to be done in dask and be distributed. What you have said sounds reasonable to me.

Thank you for your help!

For anyone else wondering the same thing, I forgot to revisit this but this has worked for me even with a dataset of roughly 9 million rows. Marking as answered now, thanks again for the help @niu!!!