Lorenz cuvres and Gini coefficients which are typically used as summary measures of inequality, are ubiquitous in development economics. A closely-related measure, the Receiver Operating Characteristic [ROC] curve is a commonly used visual model selection tool in classification problems. In a recent conversation with the Dismal Blogger, some issues of implementation came up and since I knew nothing about these measures, I decided to look them up and see if I could figure out how they are used. In this post, I mainly discuss the ROC curve.
ROC
Predictions from a model for binary data can be summarised neatly using the so-called confusion matrix. Let the outcome of interest be $y \in \{0, 1\}$ and predictions from the model be $\hat{y}$, then we have the following 2x2 contingency table:
The ROC is a fairly intuitive summary of the cost of acquiring true positives in terms of the increase in false positives, as it plots the True Positive Rate [TPR] $\left(=\dfrac{TP}{TP+FN}\right)$ against the False Positive Rate [FPR] $\left(=\dfrac{FP}{FP+TN}\right)$ for changing values of the criteria of discrimination (to be explained below). The first term is also known as the sensitivity and the second term is $1-$specificity. The idea being that as the cost (FPR) increases, the benefits (TPR) should dominate it.
To construct these numbers, we need predicted values of the outcomes from our model, $\hat{y}_i$, where $i \in {1, \dots, N}$ is the observation index, from a sample of N individuals. Typically, models for binary data, like the logit, do not produce binary predictions, but predicted probabilities of positives, i.e., $\hat{p}_i = \widehat{Prob}(y_i=1|\mathbf{x}_i)$, where $\mathbf{x}_i$ is a vector of regressors. Therefore, a binning criteria is necessary, say, $\hat{y}_i = \mathbf{1}(\hat{p}_i> c)$ for some $c \in [0, 1]$. Allowing the binning threshold to vary in its domain allows us to estimate an ROC curve for a particular model. For $c=0$, TPR = 1 and FPR = 1, so (1, 1) is a point on the curve, and for $c=1$ both TPR and FPR are zero (all predicted outcomes are zero), thus, (0,0) is a point on the curve. Note that the the entire ROC curve is contained in the unit square.
Area Under Curve [AUC]
While the entire ROC curve, for various values of the threshold parameter, can be used to assess the fitted model (more on this in a later post), single-number summaries are popular, and one such summary number is the area under the ROC curve. The interpretation is that it is the probability with which an arbitrarily chosen positive will be classified (by the model) as having higher probability than an arbitrarily chosen negative point.
This number can be estimated by numerical integration, which due to the irregular nature of the estimated curve, is typically done by the method of trapezoids (I knew high-school math was going to come in handy one day!). Stata's default numerical integration algorithm is by fitting cubic splines to the estimated curve which better for smoother curves.
An example
Here is an example of how the ROC curve is constructed by hand in Stata using the Hosmer & Lemeshow dataset. This is not remarkable in itself, but hopefully be a building block for more complicated examples. I draw the ROC curve based on a logistic fit and compute the area under the ROC curve using numerical integration:
Then, the ROC curves look like this:
Note that in the calculation of the integral, Stata only uses 79 points due to the one-many nature of the estimated function (I think).
Miscellaneous points
There were several other points I wanted to cover in this post but will have to postpone for lack of time, including,
PS. I'd like to point out again that I go to some lengths to ensure that my code runs out-of-the-box, and so is, in marketing parlance, free to try!
ROC
Predictions from a model for binary data can be summarised neatly using the so-called confusion matrix. Let the outcome of interest be $y \in \{0, 1\}$ and predictions from the model be $\hat{y}$, then we have the following 2x2 contingency table:
$\begin{tabular}{|c||c c|}
\hline
& \begin{math}y=1\end{math}&\begin{math}y=0\end{math} \\ \hline
\begin{math}\hat{y}=1\end{math}&True positive [TP]&False positive [FP]\\
\begin{math}\hat{y}=0\end{math}&False negative [FN]&True negative [TN] \\
\hline
\end{tabular} $
\hline
& \begin{math}y=1\end{math}&\begin{math}y=0\end{math} \\ \hline
\begin{math}\hat{y}=1\end{math}&True positive [TP]&False positive [FP]\\
\begin{math}\hat{y}=0\end{math}&False negative [FN]&True negative [TN] \\
\hline
\end{tabular} $
The ROC is a fairly intuitive summary of the cost of acquiring true positives in terms of the increase in false positives, as it plots the True Positive Rate [TPR] $\left(=\dfrac{TP}{TP+FN}\right)$ against the False Positive Rate [FPR] $\left(=\dfrac{FP}{FP+TN}\right)$ for changing values of the criteria of discrimination (to be explained below). The first term is also known as the sensitivity and the second term is $1-$specificity. The idea being that as the cost (FPR) increases, the benefits (TPR) should dominate it.
To construct these numbers, we need predicted values of the outcomes from our model, $\hat{y}_i$, where $i \in {1, \dots, N}$ is the observation index, from a sample of N individuals. Typically, models for binary data, like the logit, do not produce binary predictions, but predicted probabilities of positives, i.e., $\hat{p}_i = \widehat{Prob}(y_i=1|\mathbf{x}_i)$, where $\mathbf{x}_i$ is a vector of regressors. Therefore, a binning criteria is necessary, say, $\hat{y}_i = \mathbf{1}(\hat{p}_i> c)$ for some $c \in [0, 1]$. Allowing the binning threshold to vary in its domain allows us to estimate an ROC curve for a particular model. For $c=0$, TPR = 1 and FPR = 1, so (1, 1) is a point on the curve, and for $c=1$ both TPR and FPR are zero (all predicted outcomes are zero), thus, (0,0) is a point on the curve. Note that the the entire ROC curve is contained in the unit square.
Area Under Curve [AUC]
While the entire ROC curve, for various values of the threshold parameter, can be used to assess the fitted model (more on this in a later post), single-number summaries are popular, and one such summary number is the area under the ROC curve. The interpretation is that it is the probability with which an arbitrarily chosen positive will be classified (by the model) as having higher probability than an arbitrarily chosen negative point.
This number can be estimated by numerical integration, which due to the irregular nature of the estimated curve, is typically done by the method of trapezoids (I knew high-school math was going to come in handy one day!). Stata's default numerical integration algorithm is by fitting cubic splines to the estimated curve which better for smoother curves.
An example
Here is an example of how the ROC curve is constructed by hand in Stata using the Hosmer & Lemeshow dataset. This is not remarkable in itself, but hopefully be a building block for more complicated examples. I draw the ROC curve based on a logistic fit and compute the area under the ROC curve using numerical integration:
/********************************************* * Reconstruct the ROC curve in Stata **********************************************/ clear* webuse lbw tab low logistic low age lwt i.race smoke ptl ht ui // fit a logistic model predict prob_low, pr // predcited probabilities cap matrix drop roc // calculate the ROC curve using the percentiles as thresholds forv i=1(1)99 { cap drop pred_low cap drop sens cap drop spec _pctile prob_low, nq(100) local cutoff = `r(r`i')' qui { g pred_low = (prob_low >= `cutoff') g sens = (pred_low == 1 & low==1) // true positive g spec = (pred_low == 0 & low==0) // true negative count if sens local sens_mean = `r(N)' count if low // positives local sens_mean = `sens_mean' / `r(N)' count if spec local spec_mean = `r(N)' count if !low // negatives local spec_mean = `spec_mean' / `r(N)' local sum =`sens_mean' + `spec_mean' } mat roc = (nullmat(roc)\ (`sens_mean', 1-`spec_mean', `sum')) } mat roc = (1, 1, .)\ roc mat roc = roc \ (0, 0, .) cap erase roc_mat.dta svmatf , mat(roc) fil("roc_mat.dta") // save the matrix of ROC points preserve use roc_mat, clear integ c1 c2, trapezoid // calculate the area under the ROC curve using the trapezoidal rule local integ_trap =`r(integral)' integ c1 c2 // calculate the area under the ROC curve using cubic splines local integ_cub =`r(integral)' twoway line c1 c2, xtitle("1-specificity") ytitle("Sensitivity") /// note("The area under the curve is estimated as: " /// "`integ_trap' (trapezoidal) and `integ_cub' (cubic spline)") /// xla(0(.25)1, grid) yla(0(.25)1) /// title("ROC curve for estimated model") subtitle("Manual calculation") gr export roc.png, replace restore logistic low age lwt i.race smoke ptl ht ui lroc, title("ROC curve for estimated model") recast(line) /// subtitle("Stata's calculation") /// lpattern(solid) // calculate the area under the ROC curve using built-in function gr export lroc.png, replace
Then, the ROC curves look like this:
Note that in the calculation of the integral, Stata only uses 79 points due to the one-many nature of the estimated function (I think).
Miscellaneous points
There were several other points I wanted to cover in this post but will have to postpone for lack of time, including,
- the drawbacks of using the AUC, as covered here;
- the relationship of the AUC measure to the Gini coefficient as given here;
- and the Lorenz curve as the rotated ROC about the 45 degree line of symmetry.
PS. I'd like to point out again that I go to some lengths to ensure that my code runs out-of-the-box, and so is, in marketing parlance, free to try!
2 comments:
Excellent!.. Just what I was reading about.. Some questions I was looking at was
Is the value of the Gini independent of the number of bucket we break the data into and more importantly, is it independent of the empirical prob of the event in the data set?
I found some material regarding this.. need to discuss.
PS. the formulae don't show up properly in reader.. are you aware of that?
Yes, I remember the question. I am slowly (as usual) trying to get a grasp of the basics. Eventually, I will try to address the question. But next post I think I will discuss construction of the Lorenz curve and Gini coefficients and their relationship to the ROC.
I want to think carefully about their interpretations. But do send whatever you have found over. Will try to incorporate in next post.
Btw, did you note bit about trapezoidal integration? :-)
PS. Not much I can do about the formulae in Reader. The formulae parse when the Javascript recognises that the event "onload" has happened, and that is not triggered when the page is accessed via Reader.
[Krull]
Post a Comment