West Coast Stat Views (on Observational Epidemiology and more): The Scalar Fallacy

Sometimes the things that give us the most trouble are most obvious. When something is completely self-evident, it can be difficult to wrap your mind around it and think through its implications. Important points can be mistaken for tautologies (and vice versa) and when you try to work through the questions with essays or conversations, you often find yourself feeling pretentious and, for lack of a better word, silly.

Here's an example: neither vectors, random variables nor vectors of random variables are scalars. This statement is obvious to anyone familiar with the basic terms. Equally obvious is the fact that when you try to represent one of these complex, multidimensional creatures as a point on a line, you will invariably lose some information.

The implications of these points, however, are often not obvious at all.

We have to assign scalars to things all the time because, among other reasons, scalars are the only things we can rank. Any time you want to decide what's the best _____ (car, job offer, candidate), you have to start by assigning _____ a scalar. You can do this by finding a proxy that's already a scalar (like the answer to a survey question) or by using a function of the vector. Simple examples include taking the sum or the sum of the squares or the average or the maximum value. (I'm going to limit this to vectors from here on but everything should generalize to random variables and vectors of random variables fairly easily.)

But, though we have to do it all the time, no one has ever found a perfect way of assigning scalars to vectors and no one ever will. This isn't pessimism; it's mathematics. You lose information when you go from a vector to a scalar. That loss means you have to be careful about contextual questions like range of data. Though there may be a few cases where we can derive the scalars from first principles, we generally have to arrive at the assignments through experimentation. We find methods that have produced useful metrics in previous situations. Unfortunately, when you move out of the range of data you encountered in those previous situations or when you otherwise find yourself in a new context, the information you could safely omit before becomes essential and the metric that has done such a good job up till now suddenly becomes worthless.

Here are a couple of examples:

A "rate your experience" question might do a good job comparing the impact of bad beverage service versus that of short delay in take-off but it will probably not do a satisfactory job comparing a forced landing and a seven hour stay on the tarmac on a hot summer day . These events fall outside the range of data the question was developed for.

A weighted average of nutrients might provide a good way of ranking most of the foods you find in the produce aisle. In the context of comparing different fruits and vegetables found in your neighborhood grocery store, you might be able to get by assuming a linear relationship between the amount of certain nutrients and healthiness. If, however, you move to the context of the dietary supplement aisle, making that linear assumption about certain nutrients can be dangerous, even deadly. Having a bottle of iron supplement pills for lunch is an extraordinarily bad idea.

These are relatively simple examples but think about all the unspeakably complicated things like happiness that people routinely discuss as if they were scalars -- "people in group A were 42% happier than people in group B." Worse yet, many researchers insist on pushing these scales to ludicrous extremes, using the same metrics to measure the impact of everything from trivial lifestyle changes to the birth of a first child. (How this affects theories like rational addiction is a subject for another post.)

Perhaps even more important than being context-specific, the scalars we assign to vectors are generally question-specific. Take the example of health. There's no meaningful way to boil this complex, multidimensional concept down to one number, but we can come up with scalars that are useful when answering certain questions. Let's say we have formulas for deriving two metrics, L and Q. L correlates very well with longevity; Q correlates very well with quality of life. For most questions about health policy, you will get similar answers with either metric, but there are cases where the two diverge sharply. Both L and Q are good measures of health, but their usefulness depends on the question you need answered.

Part of the blame for the tendency to take scalars as ideal representations of vectors rests with the "magic of the market" faction of economists and their camp followers. Markets are basically in the scalarizing business and under the proper conditions they do a pretty good job. It's easy to see how researchers grew enamored with markets' ability to set prices in such a way that resources are effectively allocated. It is a remarkable process.

But as impressive as markets are, they still are not exempt from the laws of mathematics and the limitations listed above. Prices are scalars assigned the values of things. They generally provide us with an excellent tool for prioritizing purchases and production but when you start to think of the scalars as actually being the vectors they represent, your thinking becomes sloppy and you open yourself up to dangerous mistakes.

2 comments:

Hadley WickhamJanuary 23, 2011 at 10:50 AM
"Equally obvious is the fact that when you try to represent one of these complex, multidimensional creatures as a point on a line, you will invariably lose some information." - Technically, that's not so because R and R^n have the same cardinality: http://en.wikipedia.org/wiki/Cardinality_of_the_continuum#Sets_with_cardinality_of_the_continuum
JosephJanuary 23, 2011 at 1:25 PM
@Hadley: Speaking for Mark, I interpreted the statement above as a good general principle rather than a precise statement. You have provided the obvious counter-example to it as a general point. But I think the key idea (at least the one I took away) was that the mapping of a non-trivial multi-dimensional vector onto a scalar is not easy. Furthermore, most of the really interesting stuff is in how you define the mapping (itself) which tends to be glossed over in a lot of cases (not all but I can definitely find examples in the recent medical literature).

Wednesday, January 5, 2011

The Scalar Fallacy

2 comments: