Thursday, July 15, 2010

Reconstructing aspects of the Receiver Operator Characteristic (ROC) curve in Stata

Lorenz cuvres and Gini coefficients which are typically used as summary measures of inequality, are ubiquitous in development economics. A closely-related measure, the Receiver Operating Characteristic [ROC] curve is a commonly used visual model selection tool in classification problems. In a recent conversation with the Dismal Blogger, some issues of implementation came up and since I knew nothing about these measures, I decided to look them up and see if I could figure out how they are used. In this post, I mainly discuss the ROC curve.


ROC

Predictions from a model for binary data can be summarised neatly using the so-called confusion matrix. Let the outcome of interest be $y \in \{0, 1\}$ and predictions from the model be $\hat{y}$, then we have the following 2x2 contingency table:

$\begin{tabular}{|c||c c|}
\hline
& \begin{math}y=1\end{math}&\begin{math}y=0\end{math} \\ \hline
\begin{math}\hat{y}=1\end{math}&True positive [TP]&False positive [FP]\\
\begin{math}\hat{y}=0\end{math}&False negative [FN]&True negative [TN] \\
\hline
\end{tabular} $

The ROC is a fairly intuitive summary of the cost of acquiring true positives in terms of the increase in false positives, as it plots the True Positive Rate [TPR] $\left(=\dfrac{TP}{TP+FN}\right)$ against the False Positive Rate [FPR] $\left(=\dfrac{FP}{FP+TN}\right)$ for changing values of the criteria of discrimination (to be explained below). The first term is also known as the sensitivity and the second term is $1-$specificity. The idea being that as the cost (FPR) increases, the benefits (TPR) should dominate it.

To construct these numbers, we need predicted values of the outcomes from our model, $\hat{y}_i$, where $i \in {1, \dots, N}$ is the observation index, from a sample of N individuals. Typically, models for binary data, like the logit, do not produce binary predictions, but predicted probabilities of positives, i.e., $\hat{p}_i = \widehat{Prob}(y_i=1|\mathbf{x}_i)$, where $\mathbf{x}_i$ is a vector of regressors. Therefore, a binning criteria is necessary, say, $\hat{y}_i = \mathbf{1}(\hat{p}_i> c)$ for some $c \in [0, 1]$. Allowing the binning threshold to vary in its domain allows us to estimate an ROC curve for a particular model. For $c=0$, TPR = 1 and FPR = 1, so (1, 1) is a point on the curve, and for $c=1$ both TPR and FPR are zero (all predicted outcomes are zero), thus, (0,0) is a point on the curve. Note that the the entire ROC curve is contained in the unit square.


Area Under Curve [AUC]

While the entire ROC curve, for various values of the threshold parameter, can be used to assess the fitted model (more on this in a later post), single-number summaries are popular, and one such summary number is the area under the ROC curve. The interpretation is that it is the probability with which an arbitrarily chosen positive will be classified (by the model) as having higher probability than an arbitrarily chosen negative point.

This number can be estimated by numerical integration, which due to the irregular nature of the estimated curve, is typically done by the method of trapezoids (I knew high-school math was going to come in handy one day!). Stata's default numerical integration algorithm is by fitting cubic splines to the estimated curve which better for smoother curves.


An example

Here is an example of how the ROC curve is constructed by hand in Stata using the Hosmer & Lemeshow dataset. This is not remarkable in itself, but hopefully be a building block for more complicated examples. I draw the ROC curve based on a logistic fit and compute the area under the ROC curve using numerical integration:

/*********************************************
* Reconstruct the ROC curve in Stata
**********************************************/
clear*
webuse lbw

tab low
logistic low age lwt i.race smoke ptl ht ui  // fit a logistic model
predict prob_low, pr  // predcited probabilities
cap matrix drop roc
// calculate the ROC curve using the percentiles as thresholds
forv i=1(1)99 {
 cap drop pred_low
 cap drop sens
 cap drop spec
 _pctile prob_low, nq(100)
 local cutoff = `r(r`i')'
 qui {
  g pred_low = (prob_low >= `cutoff')
  g sens = (pred_low == 1 & low==1)  // true positive
  g spec = (pred_low == 0 & low==0)  // true negative
  count if sens
  local sens_mean = `r(N)'
  count if low  // positives
  local sens_mean = `sens_mean' / `r(N)'
  count if spec  
  local spec_mean = `r(N)'
  count if !low  // negatives
  local spec_mean = `spec_mean' / `r(N)'
  local sum =`sens_mean' + `spec_mean'
 }
 mat roc = (nullmat(roc)\ (`sens_mean', 1-`spec_mean', `sum'))
}
mat roc = (1, 1, .)\ roc
mat roc = roc \ (0, 0, .)
cap erase roc_mat.dta
svmatf , mat(roc) fil("roc_mat.dta")  // save the matrix of ROC points

preserve
use roc_mat, clear
integ c1 c2, trapezoid  // calculate the area under the ROC curve using the trapezoidal rule
local integ_trap =`r(integral)'
integ c1 c2    // calculate the area under the ROC curve using cubic splines
local integ_cub =`r(integral)'
twoway line c1 c2, xtitle("1-specificity") ytitle("Sensitivity") ///
 note("The area under the curve is estimated as: " ///
 "`integ_trap' (trapezoidal) and `integ_cub' (cubic spline)") ///
 xla(0(.25)1, grid) yla(0(.25)1) ///
 title("ROC curve for estimated model") subtitle("Manual calculation")
gr export roc.png, replace 
restore

logistic low age lwt i.race smoke ptl ht ui
lroc, title("ROC curve for estimated model") recast(line) ///
 subtitle("Stata's calculation") ///
 lpattern(solid)  // calculate the area under the ROC curve using built-in function
gr export lroc.png, replace

Then, the ROC curves look like this:



Note that in the calculation of the integral, Stata only uses 79 points due to the one-many nature of the estimated function (I think).


Miscellaneous points

There were several other points I wanted to cover in this post but will have to postpone for lack of time, including,
  1. the drawbacks of using the AUC, as covered here;
  2. the relationship of the AUC measure to the Gini coefficient as given here;
  3. and the Lorenz curve as the rotated ROC about the 45 degree line of symmetry.


PS. I'd like to point out again that I go to some lengths to ensure that my code runs out-of-the-box, and so is, in marketing parlance, free to try!

2 comments:

The dismal blogger said...

Excellent!.. Just what I was reading about.. Some questions I was looking at was

Is the value of the Gini independent of the number of bucket we break the data into and more importantly, is it independent of the empirical prob of the event in the data set?

I found some material regarding this.. need to discuss.

PS. the formulae don't show up properly in reader.. are you aware of that?

Krull said...

Yes, I remember the question. I am slowly (as usual) trying to get a grasp of the basics. Eventually, I will try to address the question. But next post I think I will discuss construction of the Lorenz curve and Gini coefficients and their relationship to the ROC.

I want to think carefully about their interpretations. But do send whatever you have found over. Will try to incorporate in next post.

Btw, did you note bit about trapezoidal integration? :-)

PS. Not much I can do about the formulae in Reader. The formulae parse when the Javascript recognises that the event "onload" has happened, and that is not triggered when the page is accessed via Reader.

[Krull]