What’s the motivation?
Social scientists like to be able to work with quantitative data because it allows us to make more precise estimates (e.g., Romney is losing Ohio by 5 percent vs. Romney seems to be doing poorly in Ohio) and because it allows us to estimate our “uncertainty” in any statements we might make (remember those percentages at Nate Silver’s blog on how likely Obama was to win the election?).
At the same time, however, we know that there is a tremendous amount of information about the world that we can collect and study that comes in the form of words, not numbers. Moreover, the quantity of text that we can now collect has increased dramatically in recent years with the explosion of #BigData, including social media (e.g., tweets, blogs, status updates, etc.) but also the digitization of information that has moved online (newspapers, laws, speeches, house prices, etc.). Thus, not only do we want to be able to quantify text so we can analyze it using statistical methods, but it also turns out that there is so much text, there is no way we can read it all — let alone analyze it — without the help of machines.
For this reason, scholars have turned increasingly to the study of how we can use machines to study “text as data.” The conference I attended featured research on methods for conducting those types of studies (see this paper on how to train machines to identify what a quote is and to whom it is attributed), but also highlighted the substantive findings that such studies are beginning to produce. For example:
— The Department of Defense is more likely to release positive news during the week (when it can ostensibly gain more media attention) but does not appear to be less likely to release bad news on Fridays or Saturdays (what Josh Lyman on the West Wing called “taking out the trash“). [from research by University of Alabama political scientists Joseph Walsh and Gregory P. Austin; their paper is here].
— Antonin Scalia is usually one of the most conservative judges on the Supreme Court on most issues (as compared to judges over the past 20 years), but is actually closer to the middle in terms of labor issues [from research by political scientists Ben Lauderdale (LSE) and Tom Clark (Emory); their paper is here].
— When there is uncertainty about the status of U.S.-China relations, U.S. policymakers tend to improve their perceptions of China but Chinese policy makers tend to downgrade their opinions of the United States [from research by Harvard University Ph.D. candidate in political science Erin Baggott; her paper is here.
— When you look at all the speeches made by John McCain, Mitt Romney, and Barack Obama when they were running for president, you can document that the candidates did indeed use more “extreme” ideological language in speeches in the primary campaign (McCain and Romney to the right, Obama to the left) than they did in the general election, when all candidates move toward the center. And yes, anecdotally we have always described primaries in this way, but just because empirical evidence confirms the received wisdom does not make it any less valuable to see it confirmed. [from research by political scientists Justin Gross and computer scientists Brice Acree of the University of North Carolina and Yanchuan Sim and Noah A. Smith of Carnegie Mellon University; the paper is here.]
— Twitter data may be able to predict representative state polls of support for presidential candidates for states, days, or smaller regions and time periods when polling is unavailable. (watch out Nate Silver!) [from research by Northeastern University political scientist Nicholas Beauchamp; the paper is here.]
In addition, it has been really interesting to see the kind of data that people are now able to analyze. One paper draws on the more than 100,000 bills introduced in Congress since 1989, another from a dataset including headlines and abstracts of 1.3 million NY Times articles, and the paper from my research laboratory (the NYU Social Media and Political Participation laboratory) that we presented here contains an analysis of every tweet from every member of Congress since the start of January 2013.
* The conference is an annual event jointly sponsored by the Institute for Quantitative Social Science at Harvard University, the Ford Center for Global Citizenship at Northwestern University and the Quantitative Analysis of Textual Data for Social Sciences project at the London School of Economics. For more details, see the conference Web site.