Tuesday, March 23, 2010

More questions about the statistics of Freakonomics

Felix Salmon is on the case:

There’s a nice empirical post-script to the debate over the economic effects of classifying the Spotted Owl as an endangered species. Freakonomics cites a study putting the effect at $46 billion, but others, including John Berry, who wrote a story on the subject for the Washington Post, think it’s much closer to zero.

And now it seems the Berry side of the argument has some good Freakonomics-style panel OLS regression analysis of the microeconomy of the Pacific Northwest to back up its side of the argument. A new paper by Annabel Kirschner finds that unemployment in the region didn’t go up when the timber industry improved, and it didn’t go down when the timber industry declined — not after you adjust for much more obvious things like the presence of minorities in the area.

Comparing Apples and furniture in a box



In a previous post on branding, I used Apple as an example of a company that, because of its brand, can charge a substantial premium for its high-quality products. In this New Yorker post, James Surowiecki compares Apple to companies that take the opposite approach.

For Apple, which has enjoyed enormous success in recent years, “build it and they will pay” is business as usual. But it’s not a universal business truth. On the contrary, companies like Ikea, H. & M., and the makers of the Flip video camera are flourishing not by selling products or services that are “far better” than anyone else’s but by selling things that aren’t bad and cost a lot less. These products are much better than the cheap stuff you used to buy at Woolworth, and they tend to be appealingly styled, but, unlike Apple, the companies aren’t trying to build the best mousetrap out there. Instead, they’re engaged in what Wired recently christened the “good-enough revolution.” For them, the key to success isn’t excellence. It’s well-priced adequacy.

These two strategies may look completely different, but they have one crucial thing in common: they don’t target the amorphous blob of consumers who make up the middle of the market. Paradoxically, ignoring these people has turned out to be a great way of getting lots of customers, because, in many businesses, high- and low-end producers are taking more and more of the market. In fashion, both H. & M. and Hermès have prospered during the recession. In the auto industry, luxury-car sales, though initially hurt by the downturn, are reemerging as one of the most profitable segments of the market, even as small cars like the Ford Focus are luring consumers into showrooms. And, in the computer business, the Taiwanese company Acer has become a dominant player by making cheap, reasonably good laptops—the reverse of Apple’s premium-price approach.

Monday, March 22, 2010

True models?

The p-value discussion started by an arcile authored by Tom Siegfried has generatesd a lot of discussion. Andrew Gelman has tried to round up many of the discussion points.

But the best part of the post (besides showing the diversity out there) was hidden at the bottom. Andrew comments:

"In all the settings I've ever worked on, the probability that the model is true is . . . zero!"

Well, he is most certainly correct in pharamcoepidemiology as well. I see a lot of variation over how to handle the biases that are inherent in observational pharmacoepidemiology -- but the focus on randomized drug trials should be a major clue that these associations are tricky to model. As a point of fact, the issue of confounding by indication, channeling bias, indication bias or whatever else you want to call it is central to the field. And the underlying idea here is that we can't get enough information about participants to model the influence of drugs being channeled to sicker patients.

So I wish that, in my field as well, people would realize that the relationships are tricky and no model is ever going to be absolutely correctly specified.

The curse of large numbers and the real problem with p-values

(Some final thoughts on statistical significance)

The real problem with p-values isn't just that people want it to do something that it can't do; they want it to do something that no single number can ever do, fully describe the quality and reliability of an experiment or study. This simply isn't one of those mathematical beasts that can be reduced to a scalar. If you try then sooner or later you will inevitably run into a situation where you get the same metric for two tests of widely different quality.

Which leads me to the curse of large numbers. Those you who are familiar with statistics (i.e. pretty much everybody who reads this blog) might want to skip the next paragraph because this goes all the way back to stat 101.

Let's take simplest case we can. You want to show that the mean of some group is positive so you take a random sample and calculate the probability of getting the results you saw or something more extreme (the probability of getting exactly results you saw is pretty much zero) working under the assumption that the mean of the group was actually zero. This works because the bigger the samples you take the more the means of those samples will tend to follow a nice smooth bell curve and the closer those means will tend to group around the mean of the group you're sampling from.

(For any teachers out there, a good way of introducing the central limit theorem is to have students simulate coin flips with Excel then make histograms based on various sample sizes.)

You might think of sampling error as the average difference between the mean of the group you're interested in and the mean of the samples you take from it (that's not exactly what it means but it's close) . The bigger the sample the smaller you expect that error to be which makes sense. If you picked three people at random, you might get three tall people or three millionaires, but if you pick twenty people at random, the chances of getting twenty tall people or twenty millionaires is virtually are next to nothing.

The trouble is that sampling error is only one of the things a statistician has to worry about. The sampled population might not reflect the population you want to draw inferences about. Your sample might not be random. Data may not be accurately entered. There may be problems with aliasing and confounding. Independence assumptions may be violated. With respect to sample size, the biases associated with these problems are all fixed quantities. A big sample does absolutely nothing to address them.

There's an old joke about a statistician who wakes up to find his room on fire, says to himself "I need more observations" and goes back to sleep. We do spend a lot of our time pushing for more data (and, some would say, whining about not having enough), but we do that not because small sample sizes are the root of all of our problems but because they are the easiest problem to fix.

Of course "fix" as used here is an asymptotic concept and the asymptote is not zero. Even an infinite sample wouldn't result in a perfect study; you would still be left with all of the flaws and biases that are an inevitable part of all research no matter how well thought out and executed it may be.

This is a particular concern for the corporate statistician who often encounters the combination of large samples and low quality data. It's not unusual to see analyses done on tens or even hundreds of thousands of sales or customer records and more often than not, when the results are presented someone will point to the nano-scale p-value as an indication of the quality and reliability of the findings.

As far as I know, no one reviewing for a serious journal would think that p<0.001 means that we're 99.9% sure that a conclusion is true, but that's what almost everyone without an analytic background thinks.

And that is a problem.

Another chapter in the New Republic Debate

Check it out here.

Sunday, March 21, 2010

Silence

I'm in the process of moving from one corner of the United States to the other. Blogging may be extremely light for the next 2 weeks.

Apologies in advance!

Interesting variable taxation idea from Thoma

From Economist's View:
Political battles make it very difficult to use discretionary fiscal policy to fight a recession, so more automatic stabilizers are needed. Along those lines, if something like this were to be implemented to stabilize the economy over the business cycle, I'd prefer to do this more generally, i.e. allow income taxes, payroll taxes, etc. to vary procyclically. That is, these taxes would be lower in bad times and higher when things improve, and implemented through an automatic moving average type of rule that produces the same revenue as some target constant tax rate (e.g. existing rates).

Saturday, March 20, 2010

New Proposed National Math Standards

These actually look pretty good.

Friday, March 19, 2010

Too late for an actual post, but...

There are another couple of entries in the TNR education debate. If you're an early riser you can read them before I do.

Thursday, March 18, 2010

Some more thoughts on p-value

One of the advantages of being a corporate statistician was that generally you not only ran the test; you also explained the statistics. I could tell the department head or VP that a p-value of 0.08 wasn't bad for a preliminary study with a small sample, or that a p-value of 0.04 wasn't that impressive with a controlled study of a thousand customers. I could factor in things like implementation costs and potential returns when looking at type-I and type-II errors. For low implementation/high returns, I might set significance at 0.1. If the situation were reversed, I might set it at 0.01.

Obviously, we can't let everyone set their own rules, but (to coin a phrase) I wonder if in an effort to make things as simple as possible, we haven't actually made them simpler. Statistical significance is an arbitrary, context-sensitive cut-off that we assign before a test based on the relative costs of a false positive and a false negative. It is not a God-given value of 5%.
Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?

For a real-life example, consider the public reaction to the recent statement that we didn't have statistically significant data that the earth had warmed over the past 15 years. This was a small sample and I'm under the impression that the results would have been significant at the 0.1 level, but these points were lost (or discarded) in most of the coverage.

We need to do a better job dealing with these grays. We might try replacing the phrase "statistically significant" with "statistically significant at 10/5/1/0.1%." Or we might look at some sort of a two-tiered system, raising significance to 0.01 for most studies while making room for "provisionally significant" papers where research is badly needed, adequate samples are not available, or the costs of a type-II error are deemed unusually high.

I'm not sure how practical or effective these steps might be but I am sure we can do better. Statisticians know how to deal with gray areas; now we need to work on how we explain them.

For more on the subject, check out Joseph's posts here and here.

The winner's curse

I have heard about the article that Mark references in a previous post; it's hard to be in the epidemiology field and not have heard about it. But, for this post, I want to focus on a single aspect of the problem.

Let's say that you have a rare side effect that requires a large database to find and, even then, the power is limited. Let's say, for an illustration, that the true effect of a drug on an outcome is an Odds Ratio (or Relative Risk, it's a rare disease) of 1.50. If, by chance alone, the estimate in database A is 1.45 (95% Confidence interval: 0.99 to 1.98) and the estimate in database B is 1.55 (95% CI: 1.03 to 2.08) the what would be the result of two studies on this side effect?

Well, if database A is done first then maybe nobody ever looks at database B (these databases are often expensive to use and time consuming to analyze). If database B is used first, the second estimate will be from database A (and thus lower). In fact, there is some chance that the researchers from database A will never publish (as it has been historically the case that null results are hard to publish).

The result? Estimates of association between the drug and the outcome will tend to be biased upwards -- because the initial finding (due to the nature of null results being hard to publish) will tend to be an over-estimate of the true causal effect.

These factors make it hard to determine if a meta-analysis of observational evidence would give an asymptotically unbiased estimate of the "truth" (likely it would be biased upwards).

In that sense, on average, published results are biased to some extent.

A lot to discuss

When you get past the inflammatory opening, this article in Science News is something you should take a look at (via Felix Salmon).
“There is increasing concern,” declared epidemiologist John Ioannidis in a highly cited 2005 paper in PLoS Medicine, “that in modern research, false findings may be the majority or even the vast majority of published research claims.”

Ioannidis claimed to prove that more than half of published findings are false, but his analysis came under fire for statistical shortcomings of its own. “It may be true, but he didn’t prove it,” says biostatistician Steven Goodman of the Johns Hopkins University School of Public Health. On the other hand, says Goodman, the basic message stands. “There are more false claims made in the medical literature than anybody appreciates,” he says. “There’s no question about that.”

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical. “A lot of scientists don’t understand statistics,” says Goodman. “And they don’t understand statistics because the statistics don’t make sense.”

Wednesday, March 17, 2010

Evidence

I was reading Andrew Gelman (always a source of interesting statistical thoughts) and I started thinking about p-values in epidemiology.

Is there a measure in all of medical research more controversial than the p-value? Sometimes I really don't think so. In a lot of ways, it seems to dominate research just because it has become an informal standard. But it felt odd, the one time I did it, to say in a paper that there was no association (p=.0508) when adding a few more cases might have flipped the answer.

I don't think confidence intervals, used in the sense of "does this interval include the null", really advance the issue either. But it's true that we do want a simple way to decide if we should be concerned about a possible adverse association and the medical literature is not well constructed for a complex back and through discussion about statistical models.

I'm also not convinced that any "standard of evidence" would not be similarly misapplied. Any approach that is primarily used by trained statisticians (sensitive to it's limitations) will look good compared with a broad standard that is also applied by non-specialists.

So I guess I don't see an easy way to replace our reliance on p-values in the medical literature, but it is worth some thought.

"We could call them 'universities'"

This bit from the from Kevin Carey's entry into the New Republic Debate caught my eye:

In the end, [Diane Ravitch's] Death and Life is painfully short on non-curricular ideas that might actually improve education for those who need it most. The last few pages contain nothing but generalities: ... "Teachers must be well educated and know their subjects." That's all on page 238. The complete lack of engagement with how to do these things is striking.

If only there were a system of institutions where teachers could go for instruction in their fields. If there were such a system then Dr. Ravitch could say "Teachers must be well educated and know their subjects" and all reasonable people would assume that she meant we should require teachers to take more advanced courses and provide additional compensation for those who exceeded those requirements.

Tuesday, March 16, 2010

Some context on schools and the magic of the markets

One reason emotions run so hot in the current debate is that the always heated controversies of education have somehow become intertwined with sensitive points of economic philosophy. The discussion over child welfare and opportunity has been rewritten as an epic struggle between big government and unions on one hand and markets and entrepreneurs on the other. (insert Lord of the Rings reference here)

When Ben Wildavsky said "Perhaps most striking to me as I read Death and Life was Ravitch’s odd aversion to, even contempt for, market economics and business as they relate to education" he wasn't wasting his time on a minor aspect of the book; he was focusing on the fundamental principle of the debate.

The success or even the applicability of business metrics and mission statements in education is a topic for another post, but the subject does remind me of a presentation the head of the education department gave when I was getting my certification in the late Eighties. He showed us a video of Tom Peter's discussing In Search of Excellence then spent about an hour extolling Peters ideas.

(on a related note, I don't recall any of my education classes mentioning George Polya)

I can't say exactly when but by 1987 business-based approaches were the big thing in education and had been for quite a while, a movement that led to the introduction of charter schools at the end of the decade. And the movement has continued to this day.

In other words, American schools have been trying a free market/business school approach for between twenty-five and thirty years.

I'm not going to say anything here about the success or failure of those efforts, but it is worth putting in context.