Friday, September 10, 2010

Loose Ends

So the dust has settled a little bit (for those that don’t know this, I have moved from here) and I find myself in the possession of an incredible piece of hardware (no pun intended). I will also never underestimate Americans when they talk about size. They said it was big, and it is huge. And I have owned some large laptops in my time. Hence this (almost) celebratory blog post.

~*~

I have known the Dismal Blogger a long time and believe I know him well. Yet, over the last year, he has surprised me with his generosity and warmth. I thank him from the bottom of my heart for this.

~*~

Partly because of this gratitude, I believe I should find an alternate home for my often arcane material. The readers of this blog, carefully collected by the Dismal Blogger over the course of many years – discerning people all – have spoken out unequivocally against the hi-jacking of this blog, and I could not, as a statistician, ignore that evidence. I haven’t found a new home yet, but once I do, I will make it known on this blog.  In cyberspace as in real life, I thank the Dismal Blogger for being so generous with his blog.

~*~

Following advice here, I am using Windows Live Writer to compose this blog post. And it is nice!

Saturday, July 17, 2010

Is that intuition, or are you just happy to see me?

Some intellectual wrangling with the Dismal Blogger resulted in us reaching a well-known (atleast to me) impasse. What is the role of intuition and simplicity in science and mathematics, and is it antithetical to rigour and complexity? Despite my ardent desire to learn, I am poorly versed in methodological questions. Therefore, I supply my favourite quotation on this issue. Obviously, being a quotation, it doesn't amount to much, but it is generally how I feel about the rôle of rigour.

The rôles of rigor and intuition are subject to misconceptions. As was pointed out in Volume 1, natural intuition and natural thinking are a poor affair, but they gain strength with the development of mathematical theory. Today's intuition and applications depend on the most sophisticated theories of yesterday. Furthermore, strict theory represents economy of thought rather than luxury.
- An Introduction to Probability Theory and its Applications, Volume II (2nd edition), Feller, William (1971), pp. 3, footnote 4
In the pantheon of extraordinary textbook writers, Feller is second to none, cramming in as he does relevant example after relevant example, but never giving in to the desire to jump to conclusions based on examples - preferring to work through to confirm/deny his intuition, and using his results to inform his intuition. This last is important, because if like the American astronauts, we were to wait for all problems to be solved rigorously before we ever did statistics, we'd be here a while. The part played by experience and common sense in settings where cost, scale and speed are issues is invaluable. Just yesterday, Andrew Gelman, over on his blog dredged out this quote:
Statisticians are particularly sensitive to default settings, which makes sense considering that statistics is, in many ways, a science based on defaults. What is a "statistical method" if not a recommended default analysis, backed up by some combination of theory and experience?
I can understand that practitioners would typically choose to let the theorists battle it out amongst themselves for a decade or two to figure out relative merits of techniques, and only then move to gather up the spoils of war i.e. the most-widely applicable/median/default method. But I fear this is sub-optimal. There is little comfort in something that often works, but fails when you need it to work most (e.g. macroeconomics).
On a related note, I am little uncomfortable with the phrase - complexity for complexity's sake. There is, in my opinion, no such thing. Complexity relates to how you “do” science. Proofs can be elegant or workmanlike and algorithms & techniques can be crude or sophisticated, and you can choose one or the other based on your preferences. But the actual content of science, theorems, are exactly that, little bits of value-neutral truth which give you the power to discriminate. And just like John Locke:
To love truth for truth's sake is the principal part of human perfection in this world, and the seed-plot of all other virtues.
so we too claim.
Is all rigour a good thing? Obviously not. But useless rigour is easily seen for what it is (if assessed by a trained person) and cast aside. Not that easy to cast aside rules-of-thumb or traditions or defaults, because by definition, these things are hard to pin down.
Further reading
  1. Counterexamples in Probability And Statistics
  2. Godel's Proof (chapter II)

Thursday, July 15, 2010

Reconstructing aspects of the Receiver Operator Characteristic (ROC) curve in Stata

Lorenz cuvres and Gini coefficients which are typically used as summary measures of inequality, are ubiquitous in development economics. A closely-related measure, the Receiver Operating Characteristic [ROC] curve is a commonly used visual model selection tool in classification problems. In a recent conversation with the Dismal Blogger, some issues of implementation came up and since I knew nothing about these measures, I decided to look them up and see if I could figure out how they are used. In this post, I mainly discuss the ROC curve.


ROC

Predictions from a model for binary data can be summarised neatly using the so-called confusion matrix. Let the outcome of interest be $y \in \{0, 1\}$ and predictions from the model be $\hat{y}$, then we have the following 2x2 contingency table:

$\begin{tabular}{|c||c c|}
\hline
& \begin{math}y=1\end{math}&\begin{math}y=0\end{math} \\ \hline
\begin{math}\hat{y}=1\end{math}&True positive [TP]&False positive [FP]\\
\begin{math}\hat{y}=0\end{math}&False negative [FN]&True negative [TN] \\
\hline
\end{tabular} $

The ROC is a fairly intuitive summary of the cost of acquiring true positives in terms of the increase in false positives, as it plots the True Positive Rate [TPR] $\left(=\dfrac{TP}{TP+FN}\right)$ against the False Positive Rate [FPR] $\left(=\dfrac{FP}{FP+TN}\right)$ for changing values of the criteria of discrimination (to be explained below). The first term is also known as the sensitivity and the second term is $1-$specificity. The idea being that as the cost (FPR) increases, the benefits (TPR) should dominate it.

To construct these numbers, we need predicted values of the outcomes from our model, $\hat{y}_i$, where $i \in {1, \dots, N}$ is the observation index, from a sample of N individuals. Typically, models for binary data, like the logit, do not produce binary predictions, but predicted probabilities of positives, i.e., $\hat{p}_i = \widehat{Prob}(y_i=1|\mathbf{x}_i)$, where $\mathbf{x}_i$ is a vector of regressors. Therefore, a binning criteria is necessary, say, $\hat{y}_i = \mathbf{1}(\hat{p}_i> c)$ for some $c \in [0, 1]$. Allowing the binning threshold to vary in its domain allows us to estimate an ROC curve for a particular model. For $c=0$, TPR = 1 and FPR = 1, so (1, 1) is a point on the curve, and for $c=1$ both TPR and FPR are zero (all predicted outcomes are zero), thus, (0,0) is a point on the curve. Note that the the entire ROC curve is contained in the unit square.


Area Under Curve [AUC]

While the entire ROC curve, for various values of the threshold parameter, can be used to assess the fitted model (more on this in a later post), single-number summaries are popular, and one such summary number is the area under the ROC curve. The interpretation is that it is the probability with which an arbitrarily chosen positive will be classified (by the model) as having higher probability than an arbitrarily chosen negative point.

This number can be estimated by numerical integration, which due to the irregular nature of the estimated curve, is typically done by the method of trapezoids (I knew high-school math was going to come in handy one day!). Stata's default numerical integration algorithm is by fitting cubic splines to the estimated curve which better for smoother curves.


An example

Here is an example of how the ROC curve is constructed by hand in Stata using the Hosmer & Lemeshow dataset. This is not remarkable in itself, but hopefully be a building block for more complicated examples. I draw the ROC curve based on a logistic fit and compute the area under the ROC curve using numerical integration:

/*********************************************
* Reconstruct the ROC curve in Stata
**********************************************/
clear*
webuse lbw

tab low
logistic low age lwt i.race smoke ptl ht ui  // fit a logistic model
predict prob_low, pr  // predcited probabilities
cap matrix drop roc
// calculate the ROC curve using the percentiles as thresholds
forv i=1(1)99 {
 cap drop pred_low
 cap drop sens
 cap drop spec
 _pctile prob_low, nq(100)
 local cutoff = `r(r`i')'
 qui {
  g pred_low = (prob_low >= `cutoff')
  g sens = (pred_low == 1 & low==1)  // true positive
  g spec = (pred_low == 0 & low==0)  // true negative
  count if sens
  local sens_mean = `r(N)'
  count if low  // positives
  local sens_mean = `sens_mean' / `r(N)'
  count if spec  
  local spec_mean = `r(N)'
  count if !low  // negatives
  local spec_mean = `spec_mean' / `r(N)'
  local sum =`sens_mean' + `spec_mean'
 }
 mat roc = (nullmat(roc)\ (`sens_mean', 1-`spec_mean', `sum'))
}
mat roc = (1, 1, .)\ roc
mat roc = roc \ (0, 0, .)
cap erase roc_mat.dta
svmatf , mat(roc) fil("roc_mat.dta")  // save the matrix of ROC points

preserve
use roc_mat, clear
integ c1 c2, trapezoid  // calculate the area under the ROC curve using the trapezoidal rule
local integ_trap =`r(integral)'
integ c1 c2    // calculate the area under the ROC curve using cubic splines
local integ_cub =`r(integral)'
twoway line c1 c2, xtitle("1-specificity") ytitle("Sensitivity") ///
 note("The area under the curve is estimated as: " ///
 "`integ_trap' (trapezoidal) and `integ_cub' (cubic spline)") ///
 xla(0(.25)1, grid) yla(0(.25)1) ///
 title("ROC curve for estimated model") subtitle("Manual calculation")
gr export roc.png, replace 
restore

logistic low age lwt i.race smoke ptl ht ui
lroc, title("ROC curve for estimated model") recast(line) ///
 subtitle("Stata's calculation") ///
 lpattern(solid)  // calculate the area under the ROC curve using built-in function
gr export lroc.png, replace

Then, the ROC curves look like this:



Note that in the calculation of the integral, Stata only uses 79 points due to the one-many nature of the estimated function (I think).


Miscellaneous points

There were several other points I wanted to cover in this post but will have to postpone for lack of time, including,
  1. the drawbacks of using the AUC, as covered here;
  2. the relationship of the AUC measure to the Gini coefficient as given here;
  3. and the Lorenz curve as the rotated ROC about the 45 degree line of symmetry.


PS. I'd like to point out again that I go to some lengths to ensure that my code runs out-of-the-box, and so is, in marketing parlance, free to try!

Monday, July 5, 2010

Fixed format data in Stata

I have never needed to use fixed-format ASCII data in Stata; I typically work with CSV files. However, I was looking for data to replicate this paper by Bill Greene on a general method to incorporate selectivity into limited dependent variable models.
I was about to write to him for the data, but decided to take a quick look through his website to see if he did provide the data. Turns out he provides a subset of that dataset, Table F25.1: Expenditure and Default Data, 1319 observations, as part of the example datasets of the 6th edition of his massively bestselling Econometric Analysis.
Small problem is that that dataset is a fixed-format text file and probably formatted as an Nlogit/Limdep dataset and reading it into Stata is not straightforward.
So, I decided to figure out how to write a dictionary file and read that data in using Stata's -infile- command. The dictionary file looks like this:

dictionary {
 _first(4)  * first line of data is the fourth
 _lines(3)  * there are three lines of data per observation
 
 _line(1)   * begin with line one of each observation
 Cardhldr "Dummy variable, 1 if application for credit card accepted, 0 if not"
 Majordrg "Number of major derogatory reports"
 Age  "Age n years plus twelfths of a year"
 Income  "Yearly income (divided by 10,000)"
 Exp_Inc  "Ratio of monthly credit card expenditure to yearly income"
 
 _newline   * move to the next line of an observation
 Avgexp  "Average monthly credit card expenditure"
 Ownrent  "1 if owns their home, 0 if rent"
 Selfempl "1 if self employed, 0 if not."
 Depndt  "1 + number of dependents"
 Inc_per  "Income divided by number of dependents"
 
 _newline  * move to the next (last) line of an observation
 Cur_add  "months living at current address"
 Major  "number of major credit cards held"
 Active  "number of active credit accounts"
}
Save this file as "limdep2stata.dct". Then, this dictionary file can be used to read in the data using a do-file which looks like this:
/*
* Read in LIMDEP data in Stata
*/
infile using limdep2stata.dct, using(TableF25-1.txt) clear 
renvars _all, lower
drop if cardhldr==.  // one extra line read in 
I guess the data file lacks an end-of-file delimiter and so an extra line is read in before Stata figures out that the file has ended. I will see if there is a simple solution to avoid this. But it does no harm and the extra line is easily dropped.
And that's it! You are good to go.
PS. I must mention that the command -renvars- is due to Nick Cox and Jeroen Weesie.

Sunday, July 4, 2010

Blog clog

No sooner had I fixed the last problem with this blog, I returned home today to discover that Alex Gorbatchev who writes and maintains SyntaxHighlighter [SH], which I use to highlight the code on this blog, had made a major revision up to ver. 3.0.83 and guess what, it broke all the highlighting on my previous R code, even though the code highlighted by other brushes looked fine. My R code looked like a dog's breakfast. Since the R brush was written by someone else, I assumed that the changes to SH had made the brush incompatible with the new SH. I looked at the changes in the new version and decided on balance that I needed the R brush fixed more than I needed the new features of SH. So I started upon fixing the mess.
  1. I decided to download the last version of SH (where all my codes rendered properly) and host it myself. Previously, I was using the versions Alex was hosting on his server space and which he periodically updated with new versions.
  2. Since most of SH is JavaScript [JS], and I have had enough of hosting JS at random places on the internet (most uploading sites do not accept JS, so you need to find specific places that will), I decided to muck around with hosting it on a Google Code project. I re-used the project I had created to host the JS of the $LaTeX$ renderer. I decided to wait until I had figured it out to create a separate code project to host SH.
  3. This allowed me to do something I have always wanted to figure out - see how version control software works. I had installed Tortoise SVN eons ago but never used it. It was pretty simple as I just followed the instructions here.
  4. The main thing was to figure out how to set mime-types for the various files whether they be CSS or JS. I learnt how to do that here.
  5. Finally, I replaced the Blogger HTML template with the new links to the JS files associated with SH.
Then I spent half-a-day scratching my head because after checking everything, it would still not work (the R code rendered with visible errors). After much staring at the screen, I realised that the old SH versions were arranged chronologically rather than reverse-chronologically as I had expected and I had downloaded the oldest available minor revision of release 2! Quickly rectified and sanity was restored!
All that now remains to be done is to migrate all the SH code to a new code project which will contain all the JS associated with this blog, including the $LaTeX$ renderer. Then I can be free of the whims and fancies of developers and servers!

Friday, July 2, 2010

Javascript, HTML, jQuery and the Milk of Human Kindness

Some of you who read this blog (optimistically) would have noticed the ugliness which was perpetrated on this blog in this previous post since I put up this other post. This was largely because the $LaTeX$ parser I use to display math was also encoding certain bits of R code with the character "$!$" as $LaTeX$.

Since then I have had a great if sometimes frustrating time googling and querying the developer forums and have learnt a lot about how HTML, Javascript and jQuery work. Thank you in particular to the anon. moderator on the webdeveloper forums who spent quite a bit of time helping me through this.
The trick is to define and send the contents of a user-defined div class to the parser on load. This involves setting up your Blogger "HTML/Javascript" widget to contain the Javascript:


and then include everything you want the parser to parse as in the following example:
$\frac{1}{2}$

Wednesday, June 30, 2010

Unbiasedness and 2 counterexamples

The role of unbiasedness in statistical decision theory is ambiguous. Of the various criteria being considered here, it is the only one that does not depend solely on the risk function. Often we find that biased estimators perform better than unbiased ones from the point of view of, say, minimising the mean squared error. For this reason, many modern statisticians consider the whole concept of unbiasedness to be somewhere between a distraction and a total irrelevance.

- Essentials of Statistical Inference, Young, G.A. and R.L. Smith, CUP (2005), pp. 10

I will supply two examples - one innocuous and one dramatic to illustrate the problems with unbiasedness as a criterion for selecting estimators. These are both taken from here, but the second is orginally from here.

  1. Consider the problem of estimating the variance of observations $\{X\}_{i=1}^{n}$ drawn from a Gaussian density with mean $\mu$ and variance $\sigma^2$. Consider the estimators \[s^2 =\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2\] and \[t^2 =\frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2\] for $\sigma^2$ where $\bar{x} = \frac{\sum_{i=1}^n x_i}{n}$. $s^2$ is often insisted upon as it is claimed to be better due to its unbiasedness. However, on the more relevant criteria of mean square error (what good is an unbiased estimator for a parameter with true value zero (say) if it is -100 half the time and 100 the other half? I'd rather have the biased estimator that is +0.0001 with probability 0.6 and -0.0001 with probability 0.4) $t^2$ is superior in this case, as \[\mathbb{E}(t^2-\sigma^2)^2 < \mathbb{E}(s^2-\sigma^2)^2\]





  2. The differences in the two estimators above diminishes as the sample sizes increase and they converge to the same probability limit, $\sigma^2$, and thus are both consistent. A remarkable situation where this is not the case is when $X \sim Poisson(\lambda)$, $\lambda \in \mathbb{R}^+$. An unbiased estimator $\delta(X)$ for $e^{-2\lambda}$ satisfies the requirement \[\mathbb{E}(\delta(X)) = \sum_{x=0}^{\infty}\delta(x)\frac{e^{-\lambda}\lambda^x}{x!} = e^{-2\lambda}\], where $\frac{e^{-\lambda}\lambda^x}{x!}$ is the Poisson mass function.
    But from the power series expansion of $e^{-\lambda}$ we know that the only possible such function $\delta(X)$ is $(-1)^{X}$. Then if $X=100$, we are lead to estimate the parameter $e^{-2\lambda}$ as 1 whereas $\lambda$ is would need to be fairly high to produce a realisation of 100. And even more incredibly, if $X=3$, then $\hat{\delta} = -1$, an estimate for a parameter which must lie in $(0, 1]$. A better estimator for $e^{-2\lambda}$ is $e^{-2X}$ which is, in this case, the maximum likelihood estimator.