Monday, March 11, 2013

Some epidemiology for a change

John Cook has an interesting point:
When you reject a data point as an outlier, you’re saying that the point is unlikely to occur again, despite the fact that you’ve already seen it. This puts you in the curious position of believing that some values you have not seen are more likely than one of the values you have in fact seen.
This is especially problematic in the case of rare but important outcomes and it can be very hard to decide what to do in these cases.  Imagine a randomized controlled trial for the effectiveness of a new medication for a rare disease (maybe something memory improvement in older adults).  One of the treated participants experiences sudden cardiac death whereas nobody in the placebo group does. 

One one hand, if the sudden cardiac death had occured in the placebo group, we would be extremely reluctant to advance this as evidence that the medication in question prevents death.  On the other hand, rare but serious drug adverse events both exist and can do a great deal of damage.  The true but trivial answer is "get more data points".  Obviously, if this is a feasible option it should be pursued. 

But these questions get really tricky when there is simply a dearth of data.  Under these circumstances, I do not think that any statistical approach (frequentist, Bayesian or other) is going to give consistently useful answers, as we don't know if the outlier is a mistake (a recording error, for example) or if it is the most important feature of the data.

It's not a fun problem. 


  1. When I address this problem in my intro stat class, I normally tell the students when they are excluding the outlier they are redefining the population to be one that would not include the outlier.

    For example, if you had a data set that included the 50 U.S. states plus the District of Columbia and D.C. was an outlier, you might throw it out as it was not a state. On the other hand, if you throw out Alaska, which is also often an outlier, you need to think about what your new definition of "State" is.

  2. Ralmond: This is a really good approach when the outlier belongs to an identifiable population. My favorite example is how Seattle is an outlier for hours of sunlight per year (on the low end). But excluding it as a major US city always seems really odd, even if there are principled reasons to do so.

  3. Joseph:

    Bayes is not panacea but I think the general idea of modeling potential "outliers" using a mixture distribution (possibly a continuous mixture such as a t) is a useful general framework that allows the incorporation of additional information where available. As statistical problems go, I think this one is as close to "solved" as is anything out there.

  4. Hi Andrew:

    Oh, I agree. It is more the combination of small numbers and outliers that I find hard to handle. You often don't know if the outlier is a data recording error or your entire effect.

    It's why I am willing to put up with the guarenteed bias in a medical claims study to look for rare adverse medication effects -- the trials are often too small to let you know for sure which of the two cases you are seeing.