Twitter is full of relative junk: tweets you don't want to read from people you're not all that interested in knowing, almost all of them chiming in on topics (see this, this and this) you'd never want to hear about in real life.
That's the takeaway from this impressive graph, showing the recently created University of Michigan Social Media Job Loss Index:
The blue line tracks initial seasonally adjusted claims for unemployment insurance in official Department of Labor statistics. The red line is from a model, now updated weekly, that predicts unemployment claims based solely on the ebb and flow of Twitter missives like "I just lost my job. Who's buying my drinks tonight?"
The University of Michigan and Stanford researchers behind the index describe the project in a National Bureau of Economic Research paper here that is as fascinating for its technical detail as it is for its tour through the messy linguistics of unemployment.
Before developing the predictive index, the team, lead by Michigan's Dolan Antenucci, analyzed a dataset of 19.3 billion tweets between July 2011 and November of 2013 (that amounts to 10 percent of all tweets during that time). They developed a list of terms that they expected to find in tweets related to job loss in general, or unemployment claims in particular:
Automating a language search for signs of unemployment is tricky, though. The researchers didn't want to capture all of the meta-unemployment commentary that floods Twitter every time the government releases new employment statistics. Employment-related tweets turn out to be about one-third higher than average in weeks when the BLS Employment Situation report comes out, so the researchers tried to account for that bump.
They also had to correct for the quirks in how we talk about unemployment, which don't show up in dry statistics. We misspell stuff. We have all kinds of euphemisms for that terrible moment of losing a job. "Work" itself means many things:
We originally included the term “sacked” but eliminated the signal from further analysis because its frequency in the data — several orders of magnitude greater than other employment-related terms— suggested that its use referred to other linguistic meanings. Similarly, we eliminated “let go” because it appeared much more frequently than other employment-related phrases and seemed to have other plausible meanings.In the case of the phrase “lost * work,” inspection of the matched k-grams clearly indicated nearly universal non-employment related concepts. Many phrases referred to computer problems such as “lost all my work” and “lost my #$% work,” as well as happier references such as “lost in my work” and “lost Beethoven work.”
Nineteen billion tweets no doubt contain all kinds of intimate evidence of layoffs and downsizing, lost work and missing paychecks. The trick is identifying it. Below, the researchers compare their results using signals from those 10 phrases (in blue) against the Department of Labor's seasonally adjusted initial claims for unemployment (in black):
The social media line diverges from the official government data stream in multiple places, but the broad trends are similar. In both lines, jobless claims steadily fall through 2011, then flatten out in the first half of 2012 before dropping again. Where the lines diverge, the social media data may actually be telling us something that may not be captured in official statistics.
If you look back at the Social Media Job Loss Index in the first graph, there's a moment around Labor Day in 2013 when official statistics show a steep drop in unemployment claims that's not reflected in the Twitter data. So which measurement was wrong? It turns out the drop in official claims was partly related to a massive computer problem in California preventing the state from processing claims. Here, at least, signs from social media may have been more accurate.
Twitter does have some important biases of its own. Its users skews young and toward people with access to computers and smartphones. And so it's reasonable to assume that Twitter isn't a perfect proxy for the entire workforce (the authors also don't address how they weed out people tweeting from outside the U.S. job market). But according to Antenucci and his co-authors, the subset of people on Twitter using the platform to broadcast their job problems is much more evenly spread across demographic groups. Middle-aged and older users, in fact, are over-represented among the tweets containing these employment signals, relative to how little they tweet in general.
Now, with all that said, these researchers aren't arguing that we give up on official government statistics now that we've got social media instead. The lines above largely move in tandem, but they're not perfectly correlated. "This is to be expected," the authors write, "since they measure different things."
One measures unemployment claims, the other aggregated signs of unemployment in the casual online conversation of millions of people. The point of the exercise isn't to replicate the government data; it's to prove the value of a related indicator (and methodology) that might give us more information about the same topic, or at a finer scale.
The next challenge is to use social data like this in areas where good economic statistics don't exist, or where indicators are harder to officially measure.
Antenucci and his co-authors don't delve into it, but there's also a curious sociological implication in all of this economic analysis: An awful lot of people now feel comfortable spreading their bad job news on the Internet.