What are the chances you will vote in the 2016 general election? Did you vote in the 2012 presidential election?

These are the questions pollsters often use to identify “likely voters,” perhaps the most ubiquitous and least-understood phrase in election news — and for two big reasons:

  1. Pollsters employ widely differing methods for identifying likely voters (and many keep their methods under wraps)
  2. Research on the accuracy of likely voter identification is relatively rare, since checking whether respondents actually vote can be expensive.

Enter a major new study from the Pew Research Center, testing which methods work best for picking likely voters and how this impacted on election survey accuracy in 2014.

Strikingly, the study found likely voter "screens" developed a half-century ago are still effective at filtering out non-voters and improving polls' representation of actual voters. But Pew also found knowing voters’ actual history from official records improves accuracy even more and complex machine-learning algorithms to assess people's probability of voting also improved accuracy.

The basics

Before we get into the nitty-gritty, here is a brief primer on likely voter models:

Most election polls start with a sample of the overall adult population or registered voters. But not everyone actually votes in a given election, and voters and non-voters can differ quite a bit; typically, actual voters are more Republican. Nobody "knows" who will vote, but pollsters try to identify who is likely to vote by asking questions that have been correlated with voting in the past — like those above — and filtering out respondents who are less engaged in the election.

The Pew study conducted a "validation" of likely voter methods. Basically, they asked respondents to a September 2014 survey who they supported in the 2014 congressional elections while using their traditional likely voter questions. But then they checked whether those respondents actually voted by matching their names, addresses and demographic information to official state-level voting records aggregated by the company TargetSmart.

This allowed Pew to compare the effectiveness of asking about likelihood to vote to identify future voters with relying on official records of past behavior — a method routinely used by campaign polls but less so by media surveys.

Pew used the results among verified voters to judge which likely voter models performed best. Before the election, Pew's overall registered voter sample showed Democrats with a four percentage point lead in the generic congressional ballot, 42-38. But among verified 2014 voters, Republicans held a three point edge, 44-41. The most effective likely voter model, then, is one that resulted in a vote margin with Republicans leading by three percentage points.*

Which likely voter models worked best?

The chart below outlines how all the likely voter approaches performed. We'll walk through the specifics, but one of the clearest takeaways is the benefit of official historical voting records. As Pew researcher Scott Keeter explained in an interview, "If you have verified past votes you can significantly improve the accuracy of predictions — at least in a low-turnout election."

Take Pew's traditional Perry-Gallup scale method for instance, which includes respondents with only the top 60 percent of scores on a seven point index. It showed an even 47-47 race between Democrats and Republicans. The traditional model was effective in producing a more accurate estimate than registered voters overall (42 Dem-38 Rep).

But when past voting indicators from the voter file were added to the model, the margin shifted to a one point Republican advantage — R+1 — slightly closer to the R+3 benchmark. The impact was larger under a more complex logistic regression approach, where the addition of vote history as a predictor flipped a two-point Democratic advantage (D+2) to a two-point GOP edge (R+2), coming within one point of the verified voter margin.

The study also showed the promise of some more complex statistical methods to predict likelihood to vote, such as "random forest," a machine-learning method that determines combinations of attributes that best predict whether someone will vote.

Likely voter models using these methods performed at least as well as the traditional Perry-Gallup scale and produced a slightly better estimate. The voter list firm TargetSmart's proprietary measure of voter likelihood developed by Clarity Campaign Labs also resulted in a strong likely voter model, resulting in a Republican lead of four points (R+4) compared with the benchmark of three points. Logistic regression models using likely voter scales performed worst but were greatly improved when combined with records of historical voting.

The report offered some clues as to why likely voter models using official records for voting tended to provide a more accurate picture of the future electorate. The study confirmed a longstanding challenge of asking about vote likelihood: Many people who say they will vote don't actually show up, and some who appear unlikely to vote eventually do cast ballots. The table below shows how respondents ranked on a 7 point scale of likely voting used by Gallup and Pew; 83 percent of those who scored the highest on the scale (most likely to vote) actually voted, but there was no record of voting for 17 percent of them. On the opposite end, more than 1 in 5 of verified 2014 voters scored a 4 or lower on the likely voter scale.

While all Gallup's likely voter survey questions were correlated with voting, they were not as powerful predictors as actual records of past voting behavior.

What this means for poll watchers in the 2016 election

The fast-approaching 2016 primary contests offer a big test for pollsters' likely voter models — turnout will be even lower than the 2014 midterm general election, and sorting voters from non-voters will be difficult in an already error-prone polling situation. Add to that mix Donald Trump, who is motivating non-traditional voters who it's not clear will actually turn out to vote, and you get the picture.

Most relevant to caucus and primary polls, the Pew findings indicate that polls based on voter registration lists — which often include voters previous history of participation — have a leg-up in identifying likely voters, compared with polls that only ask whether respondents will vote or their previous voting history (such as based on Random Digit Dialing of samples of adults). Better likely voter identification may help mitigate other challenges, such as the inability to contact large portions of the electorate whose names cannot be matched to a telephone number.

For both the primary and general election, the results illustrate the large impact of pollsters' decisions on identifying likely voters. Even this deep dive is based on just a single election, so it's unclear how well the same models would perform in a higher-turnout scenario; Pew's Keeter says he hopes to conduct a similar assessment this year to find out.

* Pew's September margin of Republicans plus 3 among validated voters is smaller than their final advantage in the election; a post-election survey of the same respondents found Republicans' advantage grew to nearly the identical actual vote margin.