The promise of big data, the term given to analyzing trends in enormous data sets, is that it could help identify trends faster than ever before. In practice? There are a few roadblocks.
Big data analysis is great if your information is in formats that are easy for computers to read, such as spreadsheets with numbers, or responses on a scale from one to five. But a lot of information isn't organized like that. Instead, it's in presentations, memos, reports, comments or just plain e-mail. Analysis of that kind of information -- often called "unstructured" or "dark" data -- is really tough to do by computer, and companies including Intel, SAP and HP are looking for a more reliable way to do it.
Another firm, uReveal, thinks that it's cracked the code. Charles "Bucky" Clarkson, uReveal's chairman and CEO, said that software such as his makes it easier to to parse all those government reports and organize the data so that analysts can get more out of it, and more quickly. He also claims that the software is so simple to use that (gasp!) even liberal arts majors can use it.
Jokes about "soft majors" aside, the idea of data analysis tools easy for anyone to use is compelling because it frees up data scientists to do more specialized work. It can also bring in specialist knowledge from people who aren't data scientists. Clarkson said, for example, that deploying these kinds of data analysis tools in hospitals have allowed doctors to spot trends that they would have otherwise missed, by analyzing their observation notes in conjunction with other electronic medical records. In one case, Clarkson said, the data even allowed physicians to discover that one type of pigskin graft was superior to another type of graft -- a conclusion that went against the conventional medical wisdom.
Consider, for example, the testimonial from Mike Wood, former executive director of Recovery Accountability and Transparency Board -- the organization which is responsible for tracking the funds disbursed by the American Recovery and Reinvestment Act of 2009. The whole point of the board is to follow the money, make sure it's ending up in the right places, and see how many jobs specific grants create in certain districts.
Sounds easy enough, right? It seems like you should be able to tell exactly how, for example, funds in Duval County, Fla. were used to create jobs. But that became a tough problem for Wood and his team. The money flows to other funds, in many cases, where it's paid out to even smaller groups. That means devoting a lot of time and effort to reading through all those reports to find out how it eventually gets spent.
And reports aren't easy to parse. Even if you do an analysis of a report to see how many times a certain word is used -- say, "bug," for example, you can still miss a lot because you need the ability to read context.
"You need something that can disambiguate, and tell the difference between a VW bug, a virus and a computer bug," said Wood, who now advises uReveal.
Using a number of tools including uReveal, however, the Recovery Board was able to comb through those reports and find out exactly how the money was used. As it tuns out, for example, the Florida grant did create jobs -- in Lansing, Mich., after the Florida school district ordered new buses.
And that's just one example of how new software can illuminate this data. This sort of analysis could also help when it comes to getting data sets to work together, experts say. Another uReveal adviser, former vice chairman of the Joint Chiefs of Staff Gen. James Cartwright, said that using unstructured data analysis would help when the government is working across departments or with outside contractors who have their own ways of organizing data.
It could also, he said, help generate reports without exposing security or privacy risks. The software could be told to automatically strip out certain types of information, such as sensitive medical information, from reports. Or it could establish a way to show that the "Jim Jones" of one report is really the same guy as "James Jones" in another -- a surprisingly prevalent problem for those who wade through a lot of documents.
It could also match data sets against clearance lists to make sure information is being read by the right people, he said. Contextual analysis can "keep messages from getting 'Snowdenized,'" Cartwright said, referring to the fact that former National Security Agency contractor Edward Snowden had broad access to information -- information he needed as a system administrator that also happened to let him leak a lot information on the U.S. government. "You don't have a whole bunch of stuff coming out to a distribution that doesn't need it," he said.
Wood said that the ability to collect and parse data from government reports has become even more important given the debate around the Digital Accountability and Transparency Act, or DATA Act, which establishes government-wide financial data standards for all federal fund spent by agencies and other entities. It also aims to centralize where government financial data are published online.
Creating data standards should help some of the problems big data enthusiasts have hit over the years, Wood said, but not all of them. But by adding software that can mine deeper, and more intelligently, the government could get even more information out of its vast cache of memos.
"I think big data is one thing," Wood said. "But you need to be able to put it in context."