Wednesday, July 31, 2013

General versus particular cases

Andrew Gelman did a very interesting article in Slate on how being overly reliant on statistical significance can lead to spurious findings.  The authors of the study that he was critiquing replied to his piece.  Andrew's thoughts on the response are here

The led to two thoughts.  One, I am completely unimpressed by claims that a paper being in a peer-reviewed journal -- that is a screen but even good test have false positives.  All this convinces me of is that the authors were thoughtful in the development of the article, not that they are immune to problems.  But this is true of all papers, including mine. 

Two, I think that this is a very tough area to take a single example from.  The reason is that any one paper could well have followed the highest possible level of rigor, as Jessica Tracy and Alec Beall claim they have done.  That doesn't necessarily mean that all studies in the class have followed these practices or that there were not filters that aided or impeded publication that might enhance the risk of a false positive.

For example, I have just finished publishing a paper where I had an unexpected finding that I wanted to replicate (that there was an association was a priori, the direction was reversed from the a priori hypothesis).  I found such a study, added additional authors, added additional analysis, rewrote the paper to be a careful combination of two different cohorts, and redid the discussion.  Guess what, the finding did not replicate.  So then I had  the special gift of publishing a null paper with a lot of authors and some potentially confusing associations.  If I had just given up at that point, the question might have been hanging around until somebody else found the same thing (I often used widely available data in my research) and published it. 

So I would be cautious about multiplying the p-values together for a probability of a false positive.  Jessica Tracy and Alec Beall:
The chance of obtaining the same significant effect across two independent consecutive studies is .0025 (Murayama, K., Pekrun, R., & Fiedler, K. (in press). Research practices that can prevent an inflation of false-positive rates. Personality and Social Psychology Review.)
I suspect that this would only hold if the testable hypothesis was clearly stated before either study was done.  It also presumes independence (it is not always obvious that this will hold as design elements of studies may influence each other) and that there isn't a confounding factor involved (that is causing both the exposure and the outcome).

Furthermore, I think as epidemiologists we need to make a decision about whether these studies are making strong causal claims or advancing a prospective association that may led to a better understanding of a disease state.  We often write articles speaking in the later mode but then lapse into the former when being quoted. 

So I guess I am writing a lot to say a couple of things in conclusion. 

One, it is very hard to pick a specific example of a general problem when it is possible that any one example might happen to meet the standards required for the depth of inference being made.  This is very hard to ascertain within the standards of the literature. 

Two, the decision of what to study and what to publish are also pretty important steps in the process.  These things can have a powerful influence on the direction of science in a very hard to detect manner. 

So I want to thank Andrew Gelman for starting this conversation and the authors of the paper in question for acting as an example in this tough dialogue. 

1 comment:

  1. Multiplying p-values as Tracy and Beall suggest does not get you another p-value. In fact, it is unclear what it gets you besides a product of p-values. The easy way to see that is to take a study, divide the data in half at random, so that we have, in effect, two independent studies, and then mulitply their p-values together.

    P-values measure results that did not happen as well as results that did. Their product does not have the same meaning as the product of the probabilities of two independent results that did happen. If it has any meaning at all.