Friday, November 2, 2018

You should be concerned about the quality of the polls, but it's likely voter models that should worry you the most.

I've been meaning to do a good, substantial, well reasoned piece on fundamental misunderstandings about political polling. This is not that post. Things have been, let us say, busy of late and I don't have time to get this right, but I do need to get it written. I really want to get this one out in the first five days of November.

So here's the short version.

When the vast majority of journalists (even most data journalists) talk about polls being wrong, they tend to screw up the discussion on at least two levels. First because they do not grasp the distinction between data and model and second because they don't understand how either is likely to go kerplooie (okay, how would you spell it?).

The term "polls of registered voters" describes more or less raw data. A complete and detailed discussion would at this point mention weighting, stratification, and other topics but – – as previously mentioned – – this is not one of those discussions. For now, we will treat those numbers you see in the paper as summary statistics of the data.

Of course, lots of things can go wrong in the collecting. Sadly, most journalists are only aware of the least worrisome issue, sampling error. Far more troubling are inaccurate/dishonest responses and, even more importantly, nonrepresentative samples (a topic we have looked into at some depth earlier). For most reporters, "inside the margin of error" translates to "revealed word of God" and when this misunderstanding leads to disaster, they conclude that "the polls were wrong."

The term "likely voter" brings in an entirely different concept, one which is generally even less well understood by the people covering it because now we are talking not just about data, but about models. [Quick caveat: all of my experience with survey data and response models has been on the corporate side. I'm working under the assumption that the same basic approaches are being used here, but you should always consult your physician or political scientist before embarking on prognostications of your own.]

First off, it's worth noting that the very designation of "likely" is arbitrary. A model has been produced that attempts to predict the likelihood that a given individual will vote in an upcoming election, but the cut off between likely and unlikely is simply a number that the people in the field decided was reasonable. There's nothing scientific, let alone magical about it.

Far more important, particularly in the upcoming election, is the idea of range of data. Certain concepts somehow managed to be both painfully obvious and frequently forgotten. Perhaps the best example in statistics is that a model only describes the relationships found in the sample. When we try to extrapolate beyond the range of data, we can only hope that the relationships will continue to hold.

By their very nature, this is always a problem with predictive modeling, but it becomes a reason for skepticism bordering on panic when the variables you included in or perhaps more to the point, left out of your model start taking on values far in excess of anything you saw on the sample. 2018 appears to be a perfect example.

Will the relationships we've seen in the past hold? If not, will the shift favor the Democrats? The  Republicans? Or will the relationships break down in such a way that they cancel each other out? I have no intention of speculating. What I am saying is that we are currently so far out of the range of data on so many factors that I'm not sure it makes sense to talk about likely voters at all.

No comments:

Post a Comment