Wednesday, March 24, 2010

Fighting words from Andrew Gelman

Or at least a fighting summary of someone else's...

[I've got a meeting coming up so this will have to be quick and ugly and leave lots of plot threads dangling for the sequel]

From Andrew's reaction to Triumph of the Thriller by Patrick Anderson:

Anderson doesn't really offer any systematic thoughts on all this, beyond suggesting that a higher quality of talent goes into thriller writing than before. He writes that, 50 or 70 years ago, if you were an ambitious young writer, you might want to write like Hemingway or Fitzgerald or Salinger (if you sought literary greatness with the possibility of bestsellerdom too) or like James Michener, or Herman Wouk (if you sought fame and fortune with the possibility of some depth as well) or like Harold Robbins or Irving Wallace (if you wanted to make a business out of your writing). But the topselling authors of mysteries were really another world entirely--even though their books were ubiquitous in drugstore and bus-station bookracks, and even occasionally made their way onto the bestseller lists, they barely overlapped with serious fiction, or with bestselling commercial fiction.

Nowadays, though, a young writer seeking fame and fortune (or, at least, a level of financial security allowing him to write and publish what he wants) might be drawn to the thriller, Anderson argues, for its literary as well as commercial potential. At the very least, why aim to be a modern-day Robbins or Michener if instead you can follow the footsteps of Scott Turow. And not just as a crime novelist, but as a writer of series: "Today, a young novelist with my [Anderson's] journalistic knack for action and dialogue would be drawn to a crime series; if not, his publisher would push him in that direction."

1. I'd argue (and I think most literary historians would back me up) that in terms of literary quality, crime fiction was at its best from about the time Hammet started writing for Black Mask to either the Fifties or Sixties, a period that featured: Chandler; Ross and John D. MacDonald; Jim Thompson; Ed McBain; Donald Westlake; Joe Gores; Lawrence Block* and a slew of worthies currently being reprinted by Hard Case.

2. Crime writing was fairly respected at the time. Check out contemporary reviews (particularly by Dorothy Parker). It was even possible for Marquand to win a Pulitzer for a "serious" novel while writing the Mr. Moto books.

3. There is an economic explanation for both the drop in quality and the surge in sales, but that will have to wait. I have a meeting at one of the studios and I need to go buy a pair of sunglasses.


*Those last three did their best work more recently but they were a product of the pulps.

p.s. Here's an illustrative passage from the NYT on the literary respect a mystery writer might achieve back before thrillers were the dominant genre:

Ross Macdonald's appeal and importance extended beyond the mystery field. He was seen as an important California author, a novelist who evoked his region as tellingly as such mainstream writers as Nathanael West and Joan Didion. Before he died, Macdonald was given the Los Angeles Times's Robert Kirsch Award for a distinguished body of work about the West. Some critics ranked him among the best American novelists of his generation.

By any standard he was remarkable. His first books, patterned on Hammett and Chandler, were at once vivid chronicles of a postwar California and elaborate retellings of Greek and other classic myths. Gradually he swapped the hard-boiled trappings for more subjective themes: personal identity, the family secret, the family scapegoat, the childhood trauma; how men and women need and battle each other, how the buried past rises like a skeleton to confront the present. He brought the tragic drama of Freud and the psychology of Sophocles to detective stories, and his prose flashed with poetic imagery. By the time of his commercial breakthrough, some of Macdonald's concerns (the breakdown between generations, the fragility of moral and global ecologies) held special resonance for a country divided by an unpopular war and alarmed for the environment. His vision was strong enough to spill into real life, where a news story or a friend's revelation could prompt the comment "Just like a Ross Macdonald novel."

It was a vision with meaning for all sorts of readers. Macdonald got fan mail from soldiers, professors, teenagers, movie directors, ministers, housewives, poets. He was claimed as a colleague by good writers around the world, including Eudora Welty, Andrey Voznesensky, Elizabeth Bowen, Thomas Berger, Marshall McLuhan, Margaret Laurence, Osvaldo Soriano, Hugh Kenner, Nelson Algren, Donald Davie, and Reynolds Price.

Assumptions

We always talk about how hard it is to actually try and verify the assumptions required for missing data techniques to yield unbiased answers. Still, it really is a breath of fresh air when somebody tries to give some (data driven) guidance on whether or not an assumption really is reasonable. That was the case with a recent PDS article:

Marston L, Carpenter JR, Walters KR, Morris RW, Nazareth I, Petersen I. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiol Drug Saf 2010 (currently an epub)

They nicely make the case that blood pressure data is likely to be missing at random in these databases. Given my thoughts that BP data is underused, this is actually a pretty major advance as it allows more confidence in inferences from these large clinical databases.

Good show, folks!

Tuesday, March 23, 2010

More questions about the statistics of Freakonomics

Felix Salmon is on the case:

There’s a nice empirical post-script to the debate over the economic effects of classifying the Spotted Owl as an endangered species. Freakonomics cites a study putting the effect at $46 billion, but others, including John Berry, who wrote a story on the subject for the Washington Post, think it’s much closer to zero.

And now it seems the Berry side of the argument has some good Freakonomics-style panel OLS regression analysis of the microeconomy of the Pacific Northwest to back up its side of the argument. A new paper by Annabel Kirschner finds that unemployment in the region didn’t go up when the timber industry improved, and it didn’t go down when the timber industry declined — not after you adjust for much more obvious things like the presence of minorities in the area.

Comparing Apples and furniture in a box



In a previous post on branding, I used Apple as an example of a company that, because of its brand, can charge a substantial premium for its high-quality products. In this New Yorker post, James Surowiecki compares Apple to companies that take the opposite approach.

For Apple, which has enjoyed enormous success in recent years, “build it and they will pay” is business as usual. But it’s not a universal business truth. On the contrary, companies like Ikea, H. & M., and the makers of the Flip video camera are flourishing not by selling products or services that are “far better” than anyone else’s but by selling things that aren’t bad and cost a lot less. These products are much better than the cheap stuff you used to buy at Woolworth, and they tend to be appealingly styled, but, unlike Apple, the companies aren’t trying to build the best mousetrap out there. Instead, they’re engaged in what Wired recently christened the “good-enough revolution.” For them, the key to success isn’t excellence. It’s well-priced adequacy.

These two strategies may look completely different, but they have one crucial thing in common: they don’t target the amorphous blob of consumers who make up the middle of the market. Paradoxically, ignoring these people has turned out to be a great way of getting lots of customers, because, in many businesses, high- and low-end producers are taking more and more of the market. In fashion, both H. & M. and Hermès have prospered during the recession. In the auto industry, luxury-car sales, though initially hurt by the downturn, are reemerging as one of the most profitable segments of the market, even as small cars like the Ford Focus are luring consumers into showrooms. And, in the computer business, the Taiwanese company Acer has become a dominant player by making cheap, reasonably good laptops—the reverse of Apple’s premium-price approach.

Monday, March 22, 2010

True models?

The p-value discussion started by an arcile authored by Tom Siegfried has generatesd a lot of discussion. Andrew Gelman has tried to round up many of the discussion points.

But the best part of the post (besides showing the diversity out there) was hidden at the bottom. Andrew comments:

"In all the settings I've ever worked on, the probability that the model is true is . . . zero!"

Well, he is most certainly correct in pharamcoepidemiology as well. I see a lot of variation over how to handle the biases that are inherent in observational pharmacoepidemiology -- but the focus on randomized drug trials should be a major clue that these associations are tricky to model. As a point of fact, the issue of confounding by indication, channeling bias, indication bias or whatever else you want to call it is central to the field. And the underlying idea here is that we can't get enough information about participants to model the influence of drugs being channeled to sicker patients.

So I wish that, in my field as well, people would realize that the relationships are tricky and no model is ever going to be absolutely correctly specified.

The curse of large numbers and the real problem with p-values

(Some final thoughts on statistical significance)

The real problem with p-values isn't just that people want it to do something that it can't do; they want it to do something that no single number can ever do, fully describe the quality and reliability of an experiment or study. This simply isn't one of those mathematical beasts that can be reduced to a scalar. If you try then sooner or later you will inevitably run into a situation where you get the same metric for two tests of widely different quality.

Which leads me to the curse of large numbers. Those you who are familiar with statistics (i.e. pretty much everybody who reads this blog) might want to skip the next paragraph because this goes all the way back to stat 101.

Let's take simplest case we can. You want to show that the mean of some group is positive so you take a random sample and calculate the probability of getting the results you saw or something more extreme (the probability of getting exactly results you saw is pretty much zero) working under the assumption that the mean of the group was actually zero. This works because the bigger the samples you take the more the means of those samples will tend to follow a nice smooth bell curve and the closer those means will tend to group around the mean of the group you're sampling from.

(For any teachers out there, a good way of introducing the central limit theorem is to have students simulate coin flips with Excel then make histograms based on various sample sizes.)

You might think of sampling error as the average difference between the mean of the group you're interested in and the mean of the samples you take from it (that's not exactly what it means but it's close) . The bigger the sample the smaller you expect that error to be which makes sense. If you picked three people at random, you might get three tall people or three millionaires, but if you pick twenty people at random, the chances of getting twenty tall people or twenty millionaires is virtually are next to nothing.

The trouble is that sampling error is only one of the things a statistician has to worry about. The sampled population might not reflect the population you want to draw inferences about. Your sample might not be random. Data may not be accurately entered. There may be problems with aliasing and confounding. Independence assumptions may be violated. With respect to sample size, the biases associated with these problems are all fixed quantities. A big sample does absolutely nothing to address them.

There's an old joke about a statistician who wakes up to find his room on fire, says to himself "I need more observations" and goes back to sleep. We do spend a lot of our time pushing for more data (and, some would say, whining about not having enough), but we do that not because small sample sizes are the root of all of our problems but because they are the easiest problem to fix.

Of course "fix" as used here is an asymptotic concept and the asymptote is not zero. Even an infinite sample wouldn't result in a perfect study; you would still be left with all of the flaws and biases that are an inevitable part of all research no matter how well thought out and executed it may be.

This is a particular concern for the corporate statistician who often encounters the combination of large samples and low quality data. It's not unusual to see analyses done on tens or even hundreds of thousands of sales or customer records and more often than not, when the results are presented someone will point to the nano-scale p-value as an indication of the quality and reliability of the findings.

As far as I know, no one reviewing for a serious journal would think that p<0.001 means that we're 99.9% sure that a conclusion is true, but that's what almost everyone without an analytic background thinks.

And that is a problem.

Another chapter in the New Republic Debate

Check it out here.

Sunday, March 21, 2010

Silence

I'm in the process of moving from one corner of the United States to the other. Blogging may be extremely light for the next 2 weeks.

Apologies in advance!

Interesting variable taxation idea from Thoma

From Economist's View:
Political battles make it very difficult to use discretionary fiscal policy to fight a recession, so more automatic stabilizers are needed. Along those lines, if something like this were to be implemented to stabilize the economy over the business cycle, I'd prefer to do this more generally, i.e. allow income taxes, payroll taxes, etc. to vary procyclically. That is, these taxes would be lower in bad times and higher when things improve, and implemented through an automatic moving average type of rule that produces the same revenue as some target constant tax rate (e.g. existing rates).

Saturday, March 20, 2010

New Proposed National Math Standards

These actually look pretty good.

Friday, March 19, 2010

Too late for an actual post, but...

There are another couple of entries in the TNR education debate. If you're an early riser you can read them before I do.

Thursday, March 18, 2010

Some more thoughts on p-value

One of the advantages of being a corporate statistician was that generally you not only ran the test; you also explained the statistics. I could tell the department head or VP that a p-value of 0.08 wasn't bad for a preliminary study with a small sample, or that a p-value of 0.04 wasn't that impressive with a controlled study of a thousand customers. I could factor in things like implementation costs and potential returns when looking at type-I and type-II errors. For low implementation/high returns, I might set significance at 0.1. If the situation were reversed, I might set it at 0.01.

Obviously, we can't let everyone set their own rules, but (to coin a phrase) I wonder if in an effort to make things as simple as possible, we haven't actually made them simpler. Statistical significance is an arbitrary, context-sensitive cut-off that we assign before a test based on the relative costs of a false positive and a false negative. It is not a God-given value of 5%.
Letting everyone pick their own definition of significance is a bad idea but so is completely ignoring context. Does it make any sense to demand the same level of p-value from a study of a rare, slow-growing cancer (where five-years is quick and a sample size of 20 is an achievement) and a drug to reduce BP in the moderately obese (where a course of treatment lasts two week and the streets are filled with potential test subjects)? Should we ignore a promising preliminary study because it comes in at 0.06?

For a real-life example, consider the public reaction to the recent statement that we didn't have statistically significant data that the earth had warmed over the past 15 years. This was a small sample and I'm under the impression that the results would have been significant at the 0.1 level, but these points were lost (or discarded) in most of the coverage.

We need to do a better job dealing with these grays. We might try replacing the phrase "statistically significant" with "statistically significant at 10/5/1/0.1%." Or we might look at some sort of a two-tiered system, raising significance to 0.01 for most studies while making room for "provisionally significant" papers where research is badly needed, adequate samples are not available, or the costs of a type-II error are deemed unusually high.

I'm not sure how practical or effective these steps might be but I am sure we can do better. Statisticians know how to deal with gray areas; now we need to work on how we explain them.

For more on the subject, check out Joseph's posts here and here.

The winner's curse

I have heard about the article that Mark references in a previous post; it's hard to be in the epidemiology field and not have heard about it. But, for this post, I want to focus on a single aspect of the problem.

Let's say that you have a rare side effect that requires a large database to find and, even then, the power is limited. Let's say, for an illustration, that the true effect of a drug on an outcome is an Odds Ratio (or Relative Risk, it's a rare disease) of 1.50. If, by chance alone, the estimate in database A is 1.45 (95% Confidence interval: 0.99 to 1.98) and the estimate in database B is 1.55 (95% CI: 1.03 to 2.08) the what would be the result of two studies on this side effect?

Well, if database A is done first then maybe nobody ever looks at database B (these databases are often expensive to use and time consuming to analyze). If database B is used first, the second estimate will be from database A (and thus lower). In fact, there is some chance that the researchers from database A will never publish (as it has been historically the case that null results are hard to publish).

The result? Estimates of association between the drug and the outcome will tend to be biased upwards -- because the initial finding (due to the nature of null results being hard to publish) will tend to be an over-estimate of the true causal effect.

These factors make it hard to determine if a meta-analysis of observational evidence would give an asymptotically unbiased estimate of the "truth" (likely it would be biased upwards).

In that sense, on average, published results are biased to some extent.

A lot to discuss

When you get past the inflammatory opening, this article in Science News is something you should take a look at (via Felix Salmon).
“There is increasing concern,” declared epidemiologist John Ioannidis in a highly cited 2005 paper in PLoS Medicine, “that in modern research, false findings may be the majority or even the vast majority of published research claims.”

Ioannidis claimed to prove that more than half of published findings are false, but his analysis came under fire for statistical shortcomings of its own. “It may be true, but he didn’t prove it,” says biostatistician Steven Goodman of the Johns Hopkins University School of Public Health. On the other hand, says Goodman, the basic message stands. “There are more false claims made in the medical literature than anybody appreciates,” he says. “There’s no question about that.”

Nobody contends that all of science is wrong, or that it hasn’t compiled an impressive array of truths about the natural world. Still, any single scientific study alone is quite likely to be incorrect, thanks largely to the fact that the standard statistical system for drawing conclusions is, in essence, illogical. “A lot of scientists don’t understand statistics,” says Goodman. “And they don’t understand statistics because the statistics don’t make sense.”

Wednesday, March 17, 2010

Evidence

I was reading Andrew Gelman (always a source of interesting statistical thoughts) and I started thinking about p-values in epidemiology.

Is there a measure in all of medical research more controversial than the p-value? Sometimes I really don't think so. In a lot of ways, it seems to dominate research just because it has become an informal standard. But it felt odd, the one time I did it, to say in a paper that there was no association (p=.0508) when adding a few more cases might have flipped the answer.

I don't think confidence intervals, used in the sense of "does this interval include the null", really advance the issue either. But it's true that we do want a simple way to decide if we should be concerned about a possible adverse association and the medical literature is not well constructed for a complex back and through discussion about statistical models.

I'm also not convinced that any "standard of evidence" would not be similarly misapplied. Any approach that is primarily used by trained statisticians (sensitive to it's limitations) will look good compared with a broad standard that is also applied by non-specialists.

So I guess I don't see an easy way to replace our reliance on p-values in the medical literature, but it is worth some thought.