(Some final thoughts on statistical significance)
The real problem with p-values isn't just that people want it to do something that it can't do; they want it to do something that no single number can ever do, fully describe the quality and reliability of an experiment or study. This simply isn't one of those mathematical beasts that can be reduced to a scalar. If you try then sooner or later you will inevitably run into a situation where you get the same metric for two tests of widely different quality.
Which leads me to the curse of large numbers. Those you who are familiar with statistics (i.e. pretty much everybody who reads this blog) might want to skip the next paragraph because this goes all the way back to stat 101.
Let's take simplest case we can. You want to show that the mean of some group is positive so you take a random sample and calculate the probability of getting the results you saw or something more extreme (the probability of getting exactly results you saw is pretty much zero) working under the assumption that the mean of the group was actually zero. This works because the bigger the samples you take the more the means of those samples will tend to follow a nice smooth bell curve and the closer those means will tend to group around the mean of the group you're sampling from.
(For any teachers out there, a good way of introducing the central limit theorem is to have students simulate coin flips with Excel then make histograms based on various sample sizes.)
You might think of sampling error as the average difference between the mean of the group you're interested in and the mean of the samples you take from it (that's not exactly what it means but it's close) . The bigger the sample the smaller you expect that error to be which makes sense. If you picked three people at random, you might get three tall people or three millionaires, but if you pick twenty people at random, the chances of getting twenty tall people or twenty millionaires is virtually are next to nothing.
The trouble is that sampling error is only one of the things a statistician has to worry about. The sampled population might not reflect the population you want to draw inferences about. Your sample might not be random. Data may not be accurately entered. There may be problems with aliasing and confounding. Independence assumptions may be violated. With respect to sample size, the biases associated with these problems are all fixed quantities. A big sample does absolutely nothing to address them.
There's an old joke about a statistician who wakes up to find his room on fire, says to himself "I need more observations" and goes back to sleep. We do spend a lot of our time pushing for more data (and, some would say, whining about not having enough), but we do that not because small sample sizes are the root of all of our problems but because they are the easiest problem to fix.
Of course "fix" as used here is an asymptotic concept and the asymptote is not zero. Even an infinite sample wouldn't result in a perfect study; you would still be left with all of the flaws and biases that are an inevitable part of all research no matter how well thought out and executed it may be.
This is a particular concern for the corporate statistician who often encounters the combination of large samples and low quality data. It's not unusual to see analyses done on tens or even hundreds of thousands of sales or customer records and more often than not, when the results are presented someone will point to the nano-scale p-value as an indication of the quality and reliability of the findings.
As far as I know, no one reviewing for a serious journal would think that p<0.001 means that we're 99.9% sure that a conclusion is true, but that's what almost everyone without an analytic background thinks.
And that is a problem.