A few months ago, Twitter received more than 1,000 applications from academic research groups vying for access to a coveted data set: the site’s cache of tweets, logged since the service’s founding.
In February, the San Francisco-based micro-blogging company announced a so-called data grant program that will offer a handful of research teams access to its database, starting with Twitter co-founder Jack Dorsey’s first tweet in 2006 (“just setting up my twttr”).
Twitter has for years offered access to its real-time stream of public tweets — about 500 million a day, from about 200 million users — known as the “firehose” to tech companies and researchers that are willing to pay. The company noticed an uptick in data requests from the academic community last year after it made historical data available to paying customers, in addition to real-time tweets. The data grant program represents Twitter’s first formal offer to open the vault to researchers, free of cost.
While analytics companies want to monitor social media in real time — gauging customer sentiment about specific brands during the Super Bowl, for instance — academic researchers across various fields wanted to “go back and study things” over almost a decade of historical data, said Chris Moody, Twitter’s vice president of data strategy.
Often, researchers are trying to build models to predict the success of political campaigns, the spread of public health crises and other phenomena, Moody said.
“They call it ‘back-testing’ — they needed to back-test their hypotheses,” he said.
Providing large volumes of historical data requires vast computing power. Depending on the request, a business customer might pay tens of thousands of dollars. So for now, the company is limiting the research projects it funds for free.
Harvard Medical School pediatrics professor John Brownstein’s team, from Harvard and Boston Children’s Hospital, was one of six chosen for Twitter’s data grant pilot. His team combines food-poisoning reports from the Centers for Disease Control and Prevention with content from Internet users — reviews from restaurant rating site Yelp, tweets about meals gone wrong and occasional public Facebook posts — to paint a clearer picture of the spread of foodborne illnesses. In a few months, Brownstein hopes to debut a program that can help public health departments search Twitter for tweets about food-poisoning cases, and respond to victims accordingly.
Research applications have been diverse, Moody said. A team at the University of California at San Diego is studying whether happy people are likely to post happy images on Twitter, allowing it to measure the relative happiness of cities’ residents. Researchers at the Netherlands’ University of Twente are assessing the effectiveness of social-media campaigns encouraging early cancer detection. A University of East London group is investigating a potential link between public tweets and sports team performance.
This month, as part of a separate effort, Twitter committed $10 million over five years to establish a social-media research lab at the Massachusetts Institute of Technology. Twitter also has granted the lab access to its complete database, according to the company. Deb Roy, an MIT professor who also works as Twitter’s chief media strategist, is leading the new Laboratory for Social Machines, which is dedicated to analyzing social-media content.
In its early stages, the laboratory is still deciding which research areas to pursue, according to MIT, but Roy said he hopes to eventually study issues related to gender equality and literacy.
Because it aggregates data points from individuals, governments, businesses and other groups, Twitter’s database is particularly valuable to researchers interested in the formation and spread of public opinion, Roy said.
But the challenge, Brownstein said, is filtering out the noise in Twitter’s hundreds of millions of tweets. Twitter culled its database for Brownstein’s team, providing it with 750 million tweets seemingly related to foodborne illnesses.
“People will [tweet], ‘That makes me sick.’ Some people are meaning that sarcastically, or they mean ‘sick’ in non-health related terms,” he said. Because they’re limited to 140 characters, individual Twitter messages typically contain less detail than other posts — such as Yelp reviews or Facebook statuses — but they do provide researchers with vast quantities of data, he added.
“Whatever happens to be on your mind, people are just quickly tweeting about it,” Brownstein said. “You have to sit down to write a Yelp review.”
Without the Twitter grant, Brownstein said, he doubts his team would have been able to afford access to these data points.
“It’s hard to get funding sources to support that purchase,” he said, especially when academic institutions are just beginning to understand the value of aggregated social-media postings. Still, he noted, “the digital epidemiology field is really coming into its own over the last couple of years. The mind-set has gradually been changing.”