Sunday, January 3, 2010

Summarise

With G's permission, I am hi-jacking this blog for a little bit of statistical computing blogging. Mostly basic stuff, feel free to ask loads of questions. Ask for code!
G and I had a few discussions about data cleaning and pre-processing. Although I tend to do this mostly by intuition (read brute force), the first thing I always do (after labelling the data) is to look at summary statistics. In particular, I look at the so-called Five-number Summary - part of the overarching scheme of John Tukey's exploratory data analysis. These are the sample minimum, first quartile (25th %ile), the median, the third quartile (75th %ile) and the sample maximum; and I tend to throw in the mean and the sample size as well. This output has been surprisingly useful to me for capturing data recording errors and inconsistencies.
In R, this can be had by fivenum(x, na.rm = TRUE) and in Stata using John Gleason's -univar- (SSC).

2 comments:

The dismal blogger said...

The five number summary or Box and whisker plots are great to spot errors, but I've also found using counts to help a great deal. For example if I know that some variable must always be positive, a count of whether it is negative could give me the extent of the problem.

The real question then is how does one rectify errors in data. I don't know if there is any one universal technique to deal with it but I once faced a problem with a data set I was working with where total assets of firms were negative. This was for about 1% of the sample I was studying. I just removed these data points.. Is there any other treatment you would suggest? Similarly what do you do with outliers (a very common problem I've faced)?

Krull said...

Well, yes, but if you look at the sample minimum and see that it is negative when it should be >=0, you've got your catch.

In addition I always suggest generous use of the Stata -assert-, which throws an exception if a certain criteria is not met. So
. assert (var1>=0)
aborts with error if this condition is not met. All crucial variables need to be checked and double-checked this way.

In general with things like negative values where there should only be positive, there is very little you can do except return to your original data source and check for the source of the error. If it is negative in the original data source as well, there is very little you can do.

But here is something which works incredibly well. Write to the people who compiled the data. I would never advocate a solution like this, but in my experience this is has always worked.

Re:outliers, I tend not to think about outliers in a generic way. They are always problem-specific and ways to deal with them differ accordingly.

I will try to cobble something on outlier detection soon. But next few posts on linking C code. :-)