Wednesday, August 11, 2010


John Cook has another insightful post today on data distributions. It is an area that I know that I could stand to develop a fuller intuition about.

One point that I think comes out of his post is the idea that no real data ever fully fits a theoretical distribution. Ever. After all, all real data has noise in it. While this seems to be a obvious point, it is possible to get reasonable patterns of residuals using different distributions to fit data. Which could mean it is either distribution, or possibly neither.

Even worse, real data is sometimes a mixture of distributions based on a latent (or non-latent variables). Andrew Gelman has an example with height -- male and female adult heights are both approximately normally distributed but the combination of the two is not (there is a nice picture on page 14 of "Data Analysis using regression and multilevel/hierarchical models").

Even worse, there may be factor (e.e. genetic) that result in different mean adult heights. So you can get "fat tails". This is not a small problem as it means that your models will wildly underestimate the probability of an extreme result. I was not a huge fan of the Black Swan, but this point was correct (and, to be fair, it was the central theme of the book).

All of which is to say that I am definitely going to have to think more about this issue and , hopefully, see cases where I am not making the correct assumptions.

1 comment:

  1. That's one of the reasons I like to use non-parametric statistics when I can.