Lorenz cuvres and
Gini coefficients which are typically used as summary measures of inequality, are ubiquitous in development economics. A closely-related measure, the
Receiver Operating Characteristic [ROC] curve is a commonly used visual model selection tool in classification problems. In a recent conversation with the Dismal Blogger, some issues of implementation came up and since I knew nothing about these measures, I decided to look them up and see if I could figure out how they are used. In this post, I mainly discuss the ROC curve.
ROC
Predictions from a model for binary data can be summarised neatly using the so-called
confusion matrix. Let the outcome of interest be $y \in \{0, 1\}$ and predictions from the model be $\hat{y}$, then we have the following 2x2 contingency table:
$\begin{tabular}{|c||c c|}
\hline
& \begin{math}y=1\end{math}&\begin{math}y=0\end{math} \\ \hline
\begin{math}\hat{y}=1\end{math}&True positive [TP]&False positive [FP]\\
\begin{math}\hat{y}=0\end{math}&False negative [FN]&True negative [TN] \\
\hline
\end{tabular} $
The ROC is a fairly intuitive summary of the cost of acquiring true positives in terms of the increase in false positives, as it plots the True Positive Rate [TPR] $\left(=\dfrac{TP}{TP+FN}\right)$ against the False Positive Rate [FPR] $\left(=\dfrac{FP}{FP+TN}\right)$ for changing values of the criteria of discrimination (to be explained below). The first term is also known as the
sensitivity and the second term is $1-$
specificity. The idea being that as the cost (FPR) increases, the benefits (TPR) should dominate it.
To construct these numbers, we need predicted values of the outcomes from our model, $\hat{y}_i$, where $i \in {1, \dots, N}$ is the observation index, from a sample of N individuals. Typically, models for binary data, like the logit, do not produce binary predictions, but predicted probabilities of positives,
i.e., $\hat{p}_i = \widehat{Prob}(y_i=1|\mathbf{x}_i)$, where $\mathbf{x}_i$ is a vector of regressors. Therefore, a binning criteria is necessary, say, $\hat{y}_i = \mathbf{1}(\hat{p}_i> c)$ for some $c \in [0, 1]$. Allowing the binning threshold to vary in its domain allows us to estimate an ROC curve for a particular model. For $c=0$, TPR = 1 and FPR = 1, so (1, 1) is a point on the curve, and for $c=1$ both TPR and FPR are zero (all predicted outcomes are zero), thus, (0,0) is a point on the curve. Note that the the entire ROC curve is contained in the unit square.
Area Under Curve [AUC]
While the entire ROC curve, for various values of the threshold parameter, can be used to assess the fitted model (more on this in a later post), single-number summaries are popular, and one such summary number is the area under the ROC curve. The interpretation is that it is the probability with which an arbitrarily chosen positive will be classified (by the model) as having higher probability than an arbitrarily chosen negative point.
This number can be estimated by numerical integration, which due to the irregular nature of the estimated curve, is typically done by the
method of trapezoids (I knew high-school math was going to come in handy one day!). Stata's default numerical integration algorithm is by
fitting cubic splines to the estimated curve which better for smoother curves.
An example
Here is an example of how the ROC curve is constructed by hand in Stata using the Hosmer & Lemeshow dataset. This is not remarkable in itself, but hopefully be a building block for more complicated examples. I draw the ROC curve based on a logistic fit and compute the area under the ROC curve using numerical integration:
/*********************************************
* Reconstruct the ROC curve in Stata
**********************************************/
clear*
webuse lbw
tab low
logistic low age lwt i.race smoke ptl ht ui // fit a logistic model
predict prob_low, pr // predcited probabilities
cap matrix drop roc
// calculate the ROC curve using the percentiles as thresholds
forv i=1(1)99 {
cap drop pred_low
cap drop sens
cap drop spec
_pctile prob_low, nq(100)
local cutoff = `r(r`i')'
qui {
g pred_low = (prob_low >= `cutoff')
g sens = (pred_low == 1 & low==1) // true positive
g spec = (pred_low == 0 & low==0) // true negative
count if sens
local sens_mean = `r(N)'
count if low // positives
local sens_mean = `sens_mean' / `r(N)'
count if spec
local spec_mean = `r(N)'
count if !low // negatives
local spec_mean = `spec_mean' / `r(N)'
local sum =`sens_mean' + `spec_mean'
}
mat roc = (nullmat(roc)\ (`sens_mean', 1-`spec_mean', `sum'))
}
mat roc = (1, 1, .)\ roc
mat roc = roc \ (0, 0, .)
cap erase roc_mat.dta
svmatf , mat(roc) fil("roc_mat.dta") // save the matrix of ROC points
preserve
use roc_mat, clear
integ c1 c2, trapezoid // calculate the area under the ROC curve using the trapezoidal rule
local integ_trap =`r(integral)'
integ c1 c2 // calculate the area under the ROC curve using cubic splines
local integ_cub =`r(integral)'
twoway line c1 c2, xtitle("1-specificity") ytitle("Sensitivity") ///
note("The area under the curve is estimated as: " ///
"`integ_trap' (trapezoidal) and `integ_cub' (cubic spline)") ///
xla(0(.25)1, grid) yla(0(.25)1) ///
title("ROC curve for estimated model") subtitle("Manual calculation")
gr export roc.png, replace
restore
logistic low age lwt i.race smoke ptl ht ui
lroc, title("ROC curve for estimated model") recast(line) ///
subtitle("Stata's calculation") ///
lpattern(solid) // calculate the area under the ROC curve using built-in function
gr export lroc.png, replace
Then, the ROC curves look like this:
Note that in the calculation of the integral, Stata only uses 79 points due to the one-many nature of the estimated function (I think).
Miscellaneous points
There were several other points I wanted to cover in this post but will have to postpone for lack of time, including,
- the drawbacks of using the AUC, as covered here;
- the relationship of the AUC measure to the Gini coefficient as given here;
- and the Lorenz curve as the rotated ROC about the 45 degree line of symmetry.
PS. I'd like to point out again that I go to some lengths to ensure that my code runs out-of-the-box, and so is, in marketing parlance, free to try!