Every predictive model relies on at least one of two things. The first is the assumption that patterns and relationships will in the future look basically like they did in the past. The second is first principles, the idea that we have such a trustworthy and complete understanding of how things work that we can say with a high level of confidence that this system or set of conditions will produce this result.
Pretty much the first thing they tell you in any introductory class on regression is the all of the beautiful and deeply reassuring math around confidence about your work assumes that you only draw conclusions about the population your data comes from. When you wander out of the range of observed data, you leave behind that rigorously proven framework.
This is always a problem with predictive models because obviously the future is outside of the range of observed data. We can get around this to a degree by not straying too far, by relying on patterns and relationships that have proven stable over time, and by keeping an eye on things that might cause our models to go haywire. You're still breaking the rules, but you're not getting too far out on the ice.
Models of presidential elections have always been weakly supported both in terms of data and our understanding of the underlying mechanisms that drive them. In terms of predictive value, political data always has a sell by date. Just because a particular group reliably voted one way 50 years ago doesn't mean that you can count on them to do the same thing today likewise, polling and elections have changed so much that it makes no sense to aggregate numbers that aren't relatively recent. Take away special cases in outliers, and you can easily find yourself building models on n < 10.
In the 21st century, things have gotten even worse for political scientists. Old relationships are broken down, polling faces serious issues, and pretty much all the elections are outliers on one dimension or another.
Now add in black swans. These events are more or less by definition outside of the range of observed data. You can make math-based statements about their possible aftermath, but I wouldn't say they were based on the discipline of statistics. All you can do under the circumstances is make the most informed guess you can then run the numbers to see what you get.
I won't go so far as to say one guess is as good as another, just that you shouldn't rely too heavily on guesses period.
Comments, observations and thoughts from two bloggers on applied statistics, higher education and epidemiology. Joseph is an associate professor. Mark is a professional statistician and former math teacher.
Mark:
ReplyDeleteYou write, "Pretty much the first thing they tell you in any introductory class on regression is the all of the beautiful and deeply reassuring math around confidence about your work assumes that you only draw conclusions about the population your data comes from."
I don't think so! In Regression and Other Stories we are careful to list "validity" (you're measuring the right thing) and "representativeness" (your sample is predictive of the population) as the two most important assumptions of regression. But I think that's unusual for textbooks on regression a