Here’s how we did our research
Every day from Sept. 19 to Nov. 7, we estimated voters’ support for Donald Trump and Hillary Clinton by combining survey data and social media sentiment analysis. We did this in four steps.
Step 1: Data collection. Through the Twitter API, or application programming interface, we collected tweets written in English (ignoring Spanish or other languages) that explicitly mentioned one of the two main candidates. To forecast the nationwide popular vote, we counted only tweets coming from the United States. To estimate the state-by-state results, we analyzed tweets geolocated in each state, using the geolocation information metadata attached to each tweet.
While we were able to effectively monitor the Twitter discussion about the campaign nationally, we could not be as thorough for individual states, because only a fraction of tweets (2 to 5 percent) give location data.
Step 2: Supervised sentiment analysis. For each tweet, we measured the sentiment toward the two candidates as either positive or negative. We also measured how many said they would explicitly vote for one of the candidates. We ran a supervised sentiment analysis (i.e., human-empowered and computer-assisted technology) to get rid of noise and to grasp the nuances, irony, allusions and cultural references expressed in the texts via an opinion-mining algorithm that learns from manual tagging.
Step 3: Econometric calibration. We combined our data with survey data. Every day from Sept. 19 until Oct. 2, we ran an econometric model in which we used the results of our sentiment analysis to predict the national survey data results published at RealClearPolitics. We found that the percentage of negative comments toward Donald Trump and the online voting intentions expressed in support of Hillary Clinton were able to explain 97 percent of the overall trend.
Step 4: Social media prediction. From Oct. 3 until the election, we used the two social media indicators mentioned in Step 3 — the percentage of negative comments toward Donald Trump and the expressed intention to vote for Hillary Clinton — to predict real-time on a daily basis the result.
With this model, Donald Trump appeared to overtake Hillary Clinton in early October. For the next few weeks, Trump faced difficulties that included questions about tax evasion by his opponent and the now-infamous recording with Trump boasting about sexual assault. However, alongside those, WikiLeaks continued releasing hacked emails from the Democratic National Committee.
By the second half of October, in our model, the Republican candidate’s support was rising so sharply that Trump passed Clinton in the popular vote on the day that FBI director James B. Comey reopened the investigation into Hillary Clinton’s emails. For that final week, the race then remained narrow, with Clinton holding about a 1-point lead.
Our approach had its limits
What about the state-level results? By analyzing Twitter data as discussed above, we found that Ohio, Florida, Nevada and Colorado were basically not in competition. Even in mid-October, the first two ranked consistently for Trump, and the latter two for Clinton.
By contrast, we found through the final month of campaigning that the Pennsylvania race was much closer than expected, with predictions moving back and forth, favoring one candidate and then the other.
The Trump rise in the Midwest couldn’t have been predicted via Twitter. We didn’t decipher it in Wisconsin or Iowa, and in Michigan we would have called the race much closer than what surveys were showing, putting Clinton up by only half a point.
This social media analysis would have revealed an uncertain race. On the night before the election, it would have predicted Hillary Clinton to win the national vote by 1.2 percent (close to the actual electoral result: +2.1 percent), and would have favored her to become the next president with a probability of 59 percent.
This last prediction was wrong, of course. The inaccurate results are probably due to the limits of the geolocated data and the fact that, instead of calibrating the social media results with state-specific surveys, we relied on the national data. This was not a deliberate choice; we did so because no state polls were available every day, while national ones were.
Now we know how to incorporate social media postings in forecasting the vote
Social media data could have helped us monitor the opinions of the “shy Trump supporters” better than could polls alone. (“Shy Trump supporters” means voters who lied to pollsters about who they’d support, either because they distrusted institutions such as polling companies, or because after scandals involving the Republican candidate were reported in the media they felt uncomfortable saying they’d vote for him, even if they intended to.)
In fact, social media analyses have repeatedly shown that many people apparently feel more free to express their personal views online — not just in the U.S. election, but on Brexit and in various European elections. Of course, social media may be biased, too, given that its users are not demographically representative of the United States.
Public opinion has profoundly changed. And the way to measure it must change as well. Social media information offers useful signals about public sentiment. Catching them is the challenge for those who study and predict politics.