To evaluate accuracy, we used the Brier score, a measure of the error between predicted and actual outcomes. Scores range from zero (perfect accuracy) to 1 (perfect inaccuracy). Although no method fared especially well, there were notable patterns in relative performance.
For the overall electoral college outcome, all methods received Brier scores above 0.50. But crowd-based methods — which tended to give a higher probability to surprising outcomes — were slightly less wrong overall.
The left-hand graph below presents the estimates of Hillary Clinton’s probability of winning the electoral college over time.
Over time, the Huffington Post model grew more confident in a Clinton victory while the other methods wavered in the two weeks leading up the election. Good Judgment’s forecast was less volatile and extreme. Of course, the risk in that is that the forecast isn’t updating fast enough in response to new information.
The right-hand graph shows the Brier scores both for the month or so before the election and for the final pre-election forecast. Good Judgment and Hypermind were the most accurate methods by the last day of the campaign, with Brier scores of 0.56 and 0.61, respectively. PredictWise, Daily Kos and Huffington Post followed.
At the state level, there were four states where all of these forecasters made predictions — Florida, North Carolina, Ohio and Pennsylvania — although we report Brier scores for other battleground states too. The graph below shows these scores, which are randomly positioned on the vertical axis so they don’t overlap.
In the first four states, Good Judgment and Hypermind were virtually tied for the lowest Brier Score, and thus the best forecasts in the full pre-election period.
What does this analysis tell us about where prediction errors came from? Some poll aggregators have been criticized for being overly bullish on Clinton. Is this because their forecasts were biased in her favor or just overly confident in their chances?
If a pro-Democrat bias were responsible, poll-based models would have performed poorly across all the states that Clinton lost. This didn’t happen. For example, the Huffington Post model fared the best of all methods in Arizona and Iowa — two states that Trump won. However, those wins were more than offset by worse scores in Florida, North Carolina, Ohio and Pennsylvania. So overconfidence seems to be the main factor. By contrast, Daily Kos did better at the state level, mostly because it picked up on uncertainty in North Carolina, Florida and Ohio and produced less confident forecasts in those states.
This analysis focuses on relative differences, but most political analysts and forecasters missed signals that could have tempered their confidence. These included as the volatility of polling trends over time, the larger than usual number of undecided voters, and the risk of a systematic polling error at the state level, especially in the key battleground states.
Where do we go from here? To be useful, post-mortem analyses of forecasts must avoid two opposing errors: overlearning and complete dismissal.
It is easy to overreact to a high-profile error. (“All forecasts were wrong! No one saw this coming!”) Commentators have cast doubt on the general use of quantitative methods to predict elections. But it’s wrong to draw sweeping conclusions from mistakes in a small number of interrelated observations — in this case, election outcomes in a few battleground states.
The opposite extreme is the idea that forecasters have nothing to learn from the election. (“It’s all luck! It’s really just one observation!”) Although one presidential election can’t supply the basis for strong statistical claims, we can still learn something about producing and evaluating predictions.
Poll aggregators have already extracted some useful lessons. For example, the assumption that polling errors are highly correlated across states now has stronger empirical backing. Spatial statistics, which provide tools to account for such geographical interrelationships, may prove useful next cycle.
Forecasters could also have done better by considering historical comparison more closely rather than focusing solely on ins and outs of recent polls. For example, the success of populist and authoritarian leaders abroad — Rodrigo Duterte in the Philippines, Viktor Orban in Hungary and Nigel Farage in the U.K. — may have signaled a better chance for Trump than polls suggested. While picking the right comparisons is of course easier in hindsight, the idea of looking “outside” for points of comparison is still valuable and has been a mainstay in the psychology of prediction for decades.
Lastly, while our analysis showed crowd-based approaches — like prediction markets and Good Judgment’s prediction polls — were slightly more accurate than model-based methods in this comparison, we don’t know if this pattern would persist. The distinction between models and crowds may yet blur over time as hybrid approaches, which combine the discipline and scalability of statistical models with the diversity of information accessible to crowds, gain popularity in the next few cycles.
In the meantime, we can be confident in this: predictioneers are working hard to up their game and ensure that, come next election, no crows or bugs need to fear becoming a forecaster’s next meal.
Pavel Atanasov and Regina Joseph are the co-founders of Pytho, which uses decision science to improve prediction. They also teach research methods and strategic foresight at New York University’s Center for Global Affairs.
Disclaimer: Until 2015, Pavel Atanasov was a postdoctoral scholar, and Regina Joseph was a member of the research team, at the Good Judgment Project, a precursor to Good Judgment Open.