Or, more accurately, a complement.
I think we can all agree that the quality of a study cannot be described by a scalar. Given that, is it possible to come up with a fairly small set of standard metrics that would give us a better picture?
P-value should be included and the set shouldn't have more than four metrics. Other than that, does anyone have any suggestions?
Comments, observations and thoughts from two bloggers on applied statistics, higher education and epidemiology. Joseph is an associate professor. Mark is a professional statistician and former math teacher.
Showing posts with label p-value. Show all posts
Showing posts with label p-value. Show all posts
Tuesday, April 13, 2010
The ongoing tyranny of statistical significance testing in biomedical research
This is the title of an article that appears in the Europen Journal of Epidemiology. I think that it is a good contribution to the recent discussions on p-values. The authors seem to be focusing on the difference between clincial significance and statistical significance (and pointing out the many cases where the two may diverge).
Usually this is happens for strong effects with small samples sizes (real associations are hard to establish) and weak effects with very large sample sizes (where unimportant differences can be demonstrated).
I was surprised that the authors did not consider the possibility of observational meta-analysis -- which seems to me to be one of the possible ways to handle the issue of small sample sizes in any specific experiment. But I think the message is in concordance with the larger mesage that these tests are not a substitute for personal judgement. In the context of designed experiments they may be fine (as the experiment can be created to fit this criterion) but the uncritical use of p-values will also be an issue in the interpretation of observational research.
Usually this is happens for strong effects with small samples sizes (real associations are hard to establish) and weak effects with very large sample sizes (where unimportant differences can be demonstrated).
I was surprised that the authors did not consider the possibility of observational meta-analysis -- which seems to me to be one of the possible ways to handle the issue of small sample sizes in any specific experiment. But I think the message is in concordance with the larger mesage that these tests are not a substitute for personal judgement. In the context of designed experiments they may be fine (as the experiment can be created to fit this criterion) but the uncritical use of p-values will also be an issue in the interpretation of observational research.
Monday, March 22, 2010
The curse of large numbers and the real problem with p-values
(Some final thoughts on statistical significance)
The real problem with p-values isn't just that people want it to do something that it can't do; they want it to do something that no single number can ever do, fully describe the quality and reliability of an experiment or study. This simply isn't one of those mathematical beasts that can be reduced to a scalar. If you try then sooner or later you will inevitably run into a situation where you get the same metric for two tests of widely different quality.
Which leads me to the curse of large numbers. Those you who are familiar with statistics (i.e. pretty much everybody who reads this blog) might want to skip the next paragraph because this goes all the way back to stat 101.
Let's take simplest case we can. You want to show that the mean of some group is positive so you take a random sample and calculate the probability of getting the results you saw or something more extreme (the probability of getting exactly results you saw is pretty much zero) working under the assumption that the mean of the group was actually zero. This works because the bigger the samples you take the more the means of those samples will tend to follow a nice smooth bell curve and the closer those means will tend to group around the mean of the group you're sampling from.
(For any teachers out there, a good way of introducing the central limit theorem is to have students simulate coin flips with Excel then make histograms based on various sample sizes.)
You might think of sampling error as the average difference between the mean of the group you're interested in and the mean of the samples you take from it (that's not exactly what it means but it's close) . The bigger the sample the smaller you expect that error to be which makes sense. If you picked three people at random, you might get three tall people or three millionaires, but if you pick twenty people at random, the chances of getting twenty tall people or twenty millionaires is virtually are next to nothing.
The trouble is that sampling error is only one of the things a statistician has to worry about. The sampled population might not reflect the population you want to draw inferences about. Your sample might not be random. Data may not be accurately entered. There may be problems with aliasing and confounding. Independence assumptions may be violated. With respect to sample size, the biases associated with these problems are all fixed quantities. A big sample does absolutely nothing to address them.
There's an old joke about a statistician who wakes up to find his room on fire, says to himself "I need more observations" and goes back to sleep. We do spend a lot of our time pushing for more data (and, some would say, whining about not having enough), but we do that not because small sample sizes are the root of all of our problems but because they are the easiest problem to fix.
Of course "fix" as used here is an asymptotic concept and the asymptote is not zero. Even an infinite sample wouldn't result in a perfect study; you would still be left with all of the flaws and biases that are an inevitable part of all research no matter how well thought out and executed it may be.
This is a particular concern for the corporate statistician who often encounters the combination of large samples and low quality data. It's not unusual to see analyses done on tens or even hundreds of thousands of sales or customer records and more often than not, when the results are presented someone will point to the nano-scale p-value as an indication of the quality and reliability of the findings.
As far as I know, no one reviewing for a serious journal would think that p<0.001 means that we're 99.9% sure that a conclusion is true, but that's what almost everyone without an analytic background thinks.
And that is a problem.
The real problem with p-values isn't just that people want it to do something that it can't do; they want it to do something that no single number can ever do, fully describe the quality and reliability of an experiment or study. This simply isn't one of those mathematical beasts that can be reduced to a scalar. If you try then sooner or later you will inevitably run into a situation where you get the same metric for two tests of widely different quality.
Which leads me to the curse of large numbers. Those you who are familiar with statistics (i.e. pretty much everybody who reads this blog) might want to skip the next paragraph because this goes all the way back to stat 101.
Let's take simplest case we can. You want to show that the mean of some group is positive so you take a random sample and calculate the probability of getting the results you saw or something more extreme (the probability of getting exactly results you saw is pretty much zero) working under the assumption that the mean of the group was actually zero. This works because the bigger the samples you take the more the means of those samples will tend to follow a nice smooth bell curve and the closer those means will tend to group around the mean of the group you're sampling from.
(For any teachers out there, a good way of introducing the central limit theorem is to have students simulate coin flips with Excel then make histograms based on various sample sizes.)
You might think of sampling error as the average difference between the mean of the group you're interested in and the mean of the samples you take from it (that's not exactly what it means but it's close) . The bigger the sample the smaller you expect that error to be which makes sense. If you picked three people at random, you might get three tall people or three millionaires, but if you pick twenty people at random, the chances of getting twenty tall people or twenty millionaires is virtually are next to nothing.
The trouble is that sampling error is only one of the things a statistician has to worry about. The sampled population might not reflect the population you want to draw inferences about. Your sample might not be random. Data may not be accurately entered. There may be problems with aliasing and confounding. Independence assumptions may be violated. With respect to sample size, the biases associated with these problems are all fixed quantities. A big sample does absolutely nothing to address them.
There's an old joke about a statistician who wakes up to find his room on fire, says to himself "I need more observations" and goes back to sleep. We do spend a lot of our time pushing for more data (and, some would say, whining about not having enough), but we do that not because small sample sizes are the root of all of our problems but because they are the easiest problem to fix.
Of course "fix" as used here is an asymptotic concept and the asymptote is not zero. Even an infinite sample wouldn't result in a perfect study; you would still be left with all of the flaws and biases that are an inevitable part of all research no matter how well thought out and executed it may be.
This is a particular concern for the corporate statistician who often encounters the combination of large samples and low quality data. It's not unusual to see analyses done on tens or even hundreds of thousands of sales or customer records and more often than not, when the results are presented someone will point to the nano-scale p-value as an indication of the quality and reliability of the findings.
As far as I know, no one reviewing for a serious journal would think that p<0.001 means that we're 99.9% sure that a conclusion is true, but that's what almost everyone without an analytic background thinks.
And that is a problem.
Thursday, March 18, 2010
Some more thoughts on p-value
One of the advantages of being a corporate statistician was that generally you not only ran the test; you also explained the statistics. I could tell the department head or VP that a p-value of 0.08 wasn't bad for a preliminary study with a small sample, or that a p-value of 0.04 wasn't that impressive with a controlled study of a thousand customers. I could factor in things like implementation costs and potential returns when looking at type-I and type-II errors. For low implementation/high returns, I might set significance at 0.1. If the situation were reversed, I might set it at 0.01.
Obviously, we can't let everyone set their own rules, but (to coin a phrase) I wonder if in an effort to make things as simple as possible, we haven't actually made them simpler. Statistical significance is an arbitrary, context-sensitive cut-off that we assign before a test based on the relative costs of a false positive and a false negative. It is not a God-given value of 5%.
Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?
For a real-life example, consider the public reaction to the recent statement that we didn't have statistically significant data that the earth had warmed over the past 15 years. This was a small sample and I'm under the impression that the results would have been significant at the 0.1 level, but these points were lost (or discarded) in most of the coverage.
We need to do a better job dealing with these grays. We might try replacing the phrase "statistically significant" with "statistically significant at 10/5/1/0.1%." Or we might look at some sort of a two-tiered system, raising significance to 0.01 for most studies while making room for "provisionally significant" papers where research is badly needed, adequate samples are not available, or the costs of a type-II error are deemed unusually high.
I'm not sure how practical or effective these steps might be but I am sure we can do better. Statisticians know how to deal with gray areas; now we need to work on how we explain them.
For more on the subject, check out Joseph's posts here and here.
Obviously, we can't let everyone set their own rules, but (to coin a phrase) I wonder if in an effort to make things as simple as possible, we haven't actually made them simpler. Statistical significance is an arbitrary, context-sensitive cut-off that we assign before a test based on the relative costs of a false positive and a false negative. It is not a God-given value of 5%.
Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?
For a real-life example, consider the public reaction to the recent statement that we didn't have statistically significant data that the earth had warmed over the past 15 years. This was a small sample and I'm under the impression that the results would have been significant at the 0.1 level, but these points were lost (or discarded) in most of the coverage.
We need to do a better job dealing with these grays. We might try replacing the phrase "statistically significant" with "statistically significant at 10/5/1/0.1%." Or we might look at some sort of a two-tiered system, raising significance to 0.01 for most studies while making room for "provisionally significant" papers where research is badly needed, adequate samples are not available, or the costs of a type-II error are deemed unusually high.
I'm not sure how practical or effective these steps might be but I am sure we can do better. Statisticians know how to deal with gray areas; now we need to work on how we explain them.
For more on the subject, check out Joseph's posts here and here.
Subscribe to:
Posts (Atom)