Friday, April 30, 2010

Hypertension or Blood Pressure?

So which one do you use in your statistical models? Sometimes, in diagnosis based data sets, you don't have a choice (Hypertension is a diagnosis but blood pressure may not be captured).

It seems like a simple question but it includes a lot of complexity. The binary variable is well understood, known to be a relevant change in patient characteristics and can account for things like medication treatment. The continuous variable, whule it has a lot more information, needs some assumptions on spacification. For example, can we really assume linearity of an association between blood pressure and a clinical outcome? If we only have treated blood pressure is that the parameter of interest or is it the "underlying level of blood pressure"? If the later, we have a messy missing data problem.

I admit, as a statistics guy, I strongly incline towards the continuous version of the variable. But it is not at all clear to me that it is always the dominant choice for dealing with these types of varibles.

Thursday, April 29, 2010

Landscapes and Lab Rats

In this post I discussed gradient searches and the two great curses of the gradient searcher, small local optima and long, circuitous paths. I also mentioned that by making small changes to the landscape being searched (in other words, perturbing it) we could sometimes (with luck) improve our search metrics without significantly changing the size and location of our optima.

The idea that you can use a search on one landscape to find the optima of a similar landscape is the assumption behind more than just perturbing. It is also the basis of all animal testing of treatments for humans. This brings genotype into the landscape discussion, but not in the way it's normally used.

In evolutionary terms, we look at an animal's genotype as a set of coordinates for a vast genetic landscape where 'height' (the fitness function) represents that animal's fitness. Every species is found on that landscape, each clustering around its own local maximum.

Genotype figures in our research landscape, but instead of being the landscape itself, it becomes part of the fitness function. Here's an overly simplified example that might clear things up:

Consider a combination of two drugs. If we use the dosage of each drug as an axis, this gives us something that looks a lot like our first example with drug A being north/south, drug B being east/west and the effect we're measuring being height. In other words, our fitness function has a domain of all points on our AB plane and a range corresponding to the effectiveness of that dosage. Since we expect genetics to affect the subjects react to the drugs, genotype has to be part of that fitness function. If we ran the test on lab rats we would expect a different result than if we tested it on humans but we would hope that the landscapes would be similar (or else there would be no point in using lab rats).

Scientists who use animal testing are acutely aware of the problems of going from one landscape to another. For each system studied, they have spent a great deal of time and effort looking for the test species that functions most like humans. The idea is that if you could find an animal with, say, a liver that functions almost exactly like a human liver, you could do most of your controlled studies of liver disease on that animal and only use humans for the final stages.

As sound and appealing as that idea is, there is another way of looking at this.

On a sufficiently high level with some important caveats, all research can be looked at as a set of gradient searches over a vast multidimensional landscape. With each study, researchers pick a point on the landscape, gather data in the region then use their findings to pick their findings and those of other researchers to pick their next point.

In this context, important similarities between landscapes fall into two distinct categories: those involving the positions and magnitudes of the optima; and those involving the search properties of the landscape. Every point on the landscape corresponds to four search values: a max; the number of steps it will take to reach that max; a min; and the number of steps it will take to reach that min. Since we usually want to go in one direction (let's say maximizing), we can generally reduce that to two values for each point, optima of interest and time to converge.

All of this leads us to an interesting and somewhat counterintuitive conclusion. When searching on one landscape to find the corresponding optimum of another, we are vitally interested in seeing a high degree of correlation between the size and location of the optima but given that similarity between optima, similarity in search statistics is at best unimportant and at worst a serious problem.

The whole point of repeated perturbing then searching of a landscape is to produce a wide range of search statistics. Since we're only keeping the best one, the more variability the better. (Best here would generally be the one where the global optimum is associated with the largest region though time to converge can also be important.)

In animal testing, changing your population of test subjects perturbs the research landscape. So what? How does thinking of research using different test animals change the way that we might approach research? I'll suggest a few possibilities in my next post on the subject.

A good Bayesian Textbook?

Say that one wanted to teach Pharmacoepidemiology students about Bayesian statistics. Say further that it was important that the book be clear and easy to follow. Are there any alternatives to Gelman and Hill (which is clear but remarkably free of drug related examples)?

Just wondering . . .

Wednesday, April 28, 2010

Carbon Sequestration, Lap Band Surgery and the Seductive Allure of the Grand, Deferred Solution

There's a paper out (discussed here) which claims that (according to the Guardian):
[G]overnments wanting to use CCS have overestimated its value and says it would take a reservoir the size of a small US state to hold the CO2 produced by one power station.

Previous modelling has hugely underestimated the space needed to store CO2 because it was based on the "totally erroneous" premise that the pressure feeding the carbon into the rock structures would be constant, argues Michael Economides, professor of chemical engineering at Houston, and his co-author Christene Ehlig-Economides, professor of energy engineering at Texas A&M University
We'll see if this actually kills support for CCS, but even before the paper came out, the popularity of the idea was a clear example of Grand Deferred Solution Syndrome (GDSS).

GDSS actually requires at least two solutions. The non-GDSs need to be simple, practical, available for immediate implementation, with high likelihoods of success. The GDS (usually produced by a marketing department or think tank, though spontaneous GDS formation has been observed) does not need to be simple or practical. Its implementation date should be distant and open-ended and its likelihood of success can be anywhere from small to negligible. Sufferers of GDSS will opt for for the GDS even when its chances are one or more orders of magnitude lower than any of the non-GDSs.

Notable examples of non-GDSs include carbon taxes, plug-in hybrids and diet & exercise.* Notable examples of GDSs include fuel cell cars, liposuction and about twenty percent of solutions using the phrase "market forces."

Almost everyone has suffered a few bouts of GDSS, but cases involving climate change may be reaching pandemic proportions.


* This does not apply to those suffering from certain diagnosed medical conditions and eating disorders. For those people, extreme measures may be the only reasonable option.

Tuesday, April 27, 2010

Predicting the spread

Have you ever been working on a problem and had that nagging feeling that you're missing an obvious solution? Well, I'm having one of those moments now. I'm working on a project that, though it has nothing to do with sports or betting, is analogous to the following:

You want to build a model predicting the spread for games in a new football league. Because the line-up of teams is still in flux, you decide to use only stats from individual teams as inputs (for example, an indicator variable for when the Ambushers play the Ravagers would not be allowed). In other words, you're using data from individuals to predict a metric that is only defined for pairs.

Assume there are around fifty teams and each team has played all of the others exactly one time.

This feels like stat 101 but I can't recall seeing another problem like it. Anyone out there have any suggestions?

A serious discussion of the role of barter in health care

Last week I suggested that someone should dig into candidate Lowden's suggestion more deeply. I'm glad to say someone has.

The Colbert ReportMon - Thurs 11:30pm / 10:30c
Indecision 2010 Midterm Elections - Sue Lowden
www.colbertnation.com
Colbert Report Full EpisodesPolitical HumorFox News



I'm amazed that no one in the audience seemed to know what a chicken ranch was.

David Brooks' 100K statistic explained

If you follow this sort of thing, you may recall that a few weeks ago, David Brooks claimed that "Over the last 10 years, 60 percent of Americans made more than $100,000 in at least one of those years, and 40 percent had incomes that high for at least three." based on research by Stephen J. Rose. It was one of those statistics that just looks wrong and it turns it was, though the fault seems to lie mainly with Rose's less-than-clear prose and his algorithm for calculating adjusted household income for individuals (an individual living alone could make considerably less than six figures and still have an adjusted household income of 100K).

Andrew Sprung (who was on this from the beginning) has the details:
I should not have cast my inference that Brooks was misquoting Rose as a near-certainty without being able to verify it. Literally, there was no misquote -- or rather a minor one, converting Rose's "fully 60 percent of adults had at least one year in which their incomes were at least $100,000" to a more active verb formulation: "Over the last 10 years, 60 percent of Americans made more than $100,000." Brooks' re-cast also edits out a ghost of pronoun slippage in Rose's studiedly vague formulation: "adults" had years in which "their" incomes were over $100k. While "their" grammatically agrees with "adults," keeping both in the plural somehow highlights the elision by which household income (the term Rose uses in earlier writings citing similar statistics) becomes the income enjoyed by the individuals in the household.
(h/t to Brad DeLong)

Monday, April 26, 2010

Fitness Landscapes, Ozark Style

[Update: part two is now up.]

I grew up with a mountain in my backyard... literally. It wasn't that big (here in California we'd call it a hill) but back in the Ozarks it was a legitimate mountain and we owned about ten acres of it. Not the most usable of land but a lovely sight.

That Ozark terrain is also a great example of a fitness landscape because, depending on which side you look at, it illustrates the two serious challenges for optimization algorithms. Think about a mountainous area at least partially carved out by streams and rivers. Now remove all of the rocks, water and vegetation drop a blindfolded man somewhere in the middle, lost but equipped with a walking stick and a cell phone that can get a signal if he can get to a point with a clear line of sight to a cell tower.

With the use of his walking stick, the man has a reach of about six feet so he feels around in a circle, finds the highest point, takes two paces that direction then repeats the process (in other words, performs a gradient search). He quickly reaches a high point. That's the good news; the bad news is that he hasn't reached one of the five or six peaks that rise above the terrain. Instead, he has found the top of one of the countless hills and small mountains in the area.

Realizing the futility of repeating this process, the man remembers that an engineer friend (who was more accustomed to thinking in terms of landscape minima) suggested that if they became separated he should go to the lowest point in the area so the friend would know where to look for him. The man follows his friend's advice only to run into the opposite problem. This time his process is likely to lead to his desired destination (if he crosses the bed of a stream or a creek he's pretty much set) but it's going to be a long trip (waterways have a tendency to meander).

And there you have the two great curses of the gradient searcher, numerous small local optima and long, circuitous paths. This particular combination -- multiple maxima and a single minimum associated with indirect search paths -- is typical of fluvial geomorphology and isn't something you'd generally expect to see in other areas, but the general problems of local optima and slow convergence show up all the time.

There are, fortunately, a few things we can do that might make the situation better (not what you'd call realistic things but we aren't exactly going for verisimilitude here). We could tilt the landscape a little or slightly bend or stretch or twist it, maybe add some ridges to some patches to give it that stylish corduroy look. (in other words, we could perturb the landscape.)

Hopefully, these changes shouldn't have much effect on the size and position of the of the major optima,* but they could have a big effect on the search behavior, changing the likelihood of ending up on a particular optima and the average time to optimize. That's the reason we perturb landscapes; we're hoping for something that will give us a better optima in a reasonable time. Of course, we have no way of knowing if our bending and twisting will make things better (it could just as easily make them worse), but if we do get good results from our search of the new landscape, we should get similar results from the corresponding point on the old landscape.

In the next post in the series, I'll try to make the jump from mountain climbing to planning randomized trials.

* I showed this post to an engineer who strongly suggested I add two caveats here. First, we are working under the assumption that the major optima are large relative to the changes produced by the perturbation. Second our interest in each optima is based on its size, not whether it is global. Going back to our original example, let's say that the largest peak on our original landscape was 1,005 feet tall and the second largest was 1,000 feet even but after perturbation their heights were reversed. If we were interested in finding the global max, this would be be a big deal, but to us the difference between the two landscapes is trivial.

These assumptions will be easier to justify when start applying these concepts in the next post in the series. For now, though, just be warned that these are big assumptions that can't be made that often.

And my second favorite quote on lying

Comes from Dashiell Hammett (who, of course, had his own Hellman connection). You'll find it in the Continental Op story, "Golden Horseshoe."

"I was reading a sign high on the wall behind the bar:

ONLY GENUINE PRE-WAR AMERICAN AND BRITISH WHISKEYS SERVED HERE

I was trying to count how many lies could be found in those nine words, and had reached four, with promise of more."

Distributions and outliers

John Cook has an old but good post on the issues that even well behaved normal distributiosn can have have in the extremes. I would tend to argue that these extreme outliers (women over 6' 8", for example) probably are due to some process that is rare (i.e. a genetic mutation, an extreme environmental exposure) and so the real height distribution is a mixture of several underlying distributions with latent (or unobserved variables).

But this line of thinking is actually dangerous. After all, with enough latent variables I can model almost any distribution as a sum of normal distributions. And, if I can't observe these variables, how do I know that they exist?

So I guess this is one place where my intuitions are precisely wrong for handling the problem.

Best quote ever on lying

Matt Springer's review got me to thinking about Mary McCarthy's take on Lillian Hellman

"Every word she writes is a lie, including and and the."

(with thanks to the good people at wikiquotes)

Fox News covers quantum physics. What could possibly go wrong?

Via Felix Salmon, Matt Springer thinks he has a winner:
The Worst Physics Article Ever

Ladies and gentlemen, I give you the worst physics news article I have ever seen:

Freaky Physics Proves Parallel Universes Exist

Every word in the title is wrong but "physics". It's not freaky, doesn't prove anything we didn't already know, and has nothing to do with parallel universes nor does it shed any light the question of their possible existence.

Look past the details of a wonky discovery by a group of California scientists -- that a quantum state is now observable with the human eye -- and consider its implications: Time travel may be feasible. Doc Brown would be proud.

Quantum states are visible to the naked eye all the time. Neon signs, laser pointers, and all kinds of other devices show quantum behavior at the macroscopic level. What this UC Santa Barbara group has done is impressive and important - they've put a tiny but macroscopic object into a superposition of macroscopic quantum states. This is a big deal, but the difference between this and everyday single-atom quantum
mechanics is just one of scale. It's not new physics. And time travel? It's a category error on the scale of a reporter watching the Ottawa Senators play hockey and writing an article claiming they were the new lawmaking body of Canada.

Read more here.

In games of perfect information, bluffing is a really bad idea

But that seems to be the Republican strategy on financial reform. Jonathan Chait has the details:

So wait. Republicans think they can limit the political damage of a filibuster if they reach a bipartisan deal. But what incentive do the democrats have to reach a deal? If they can force the Republicans to maintain a filibuster, why not keep the issue going until November? The strategy here seems to be, take a political hit by opposing popular legislation, and then hope that somehow this will strengthen the party's hand in the negotiations to follow. How will this work? It's like trying to bluff your opponent in poker when both you and he know he has the stronger hand.

What's more, Republicans are no longer even pretending to be able to hold the line after today's vote. This is amazing:

McConnell secured a commitment from his conference to hold together in opposition on the first vote, but all bets are off after that, aides acknowledge. McConnell’s challenge after Monday is preventing moderates such as Snowe and Sen. Susan Collins (R-Maine) from breaking away and weakening Republican leverage.

Now that the Democrats know the Republicans are planning to defect after the first vote, why on Earth would they compromise? Moreover, what is the point of taking the hit by filibustering reform in the first place? It could work, in theory, if you could bluff the Democrats into thinking the GOP might hold the line indefinitely. But I'm pretty sure the Democratic party has access to articles published in Politico, which means the jig is up. So now the Republicans are trying to bluff in poker when they and their opponent know they have the weaker hand, and their opponent has heard them admit that their strategy is to bet for a couple rounds and fold before the end. Why not just cut their losses now? This makes zero sense.

Sunday, April 25, 2010

The roots of Apple's business model

Click for the punchline.





Friday, April 23, 2010

"Any color you want as long as it's black"

Following up on Joseph's post, I have two points about SAS's graphics:

First, as bad as they are now, you should have seen them in the early Nineties;

Second, I think the graphics are a pretty good indication of the culture of SAS, a large, privately-held company with an effective monopoly over much of its market. SAS does good work and has an incredible record of innovation but is (in the words of some of its employees) a benevolent dictatorship. The company's attitude has always been we will decide what you need and what's a fair price for it.

I don't mean this as a slam against SAS. After almost twenty years you can put me down as a satisfied customer. It's a good company to work with and, by all accounts, a great company to work for. I don't think going public would make SAS a better company, but I do think it would make it do some things better.