The researchers looked at page views and edits for Wikipedia entries on public companies that are part of the Dow Jones Industrial Average, such as Cisco, Intel, and Pfizer, (pfe) as well as wikis on economic topics such as capitalism and debt. Changes in the average number of page views and edits per week informed decisions on whether to buy or sell the DJIA. In other words, a major increase in page views could have prompted a sale, followed by a buy to close out the deal, or vice-versa (decreases in page views, say, would cause a buy, followed by a sale).
The researchers compared this investment strategy with a random investing strategy. What they found is that returns based on views of the DJIA company Wikipedia pages “are significantly higher than the returns of the random strategies,” to the tune of a 141 percent return, according to a news release.
There was also a significant difference between returns from the random strategy and the returns on the strategy tied to page views of economic topics. The yield would be 297 percent higher than what was put in in that case.
To check that there wasn’t a hidden variable in the data on views of company and topic pages, the researchers compared the earnings on Dow Jones investments tied to page views of actors and filmmakers, which had just as many page views as the pages on the DJIA companies. Indeed, they found no statistical significance there. And that makes sense in theory — who checks out Matt Damon’s Wikipedia entry before making an investment? But checking a Wikipedia page on Cisco might be a more reasonable action before investing in Cisco.
Incidentally, some of the researchers behind this project have also investigated connections between the Dow Jones and the use of certain financial search terms on Google. Other researchers have previously found connections between Google search patterns on stocks and stock price changes over time.
While predictive analytics has become a hot area — with applications from social media conversations to crime, from the flu to retweets — data scientists often acknowledge that people need to be sure the data they want to use for analysis is solid and reliable. Edit data from Wikipedia isn’t inherently reliable in the sense that anyone can edit it — and it turns out to be not statistically significant. Page views could perhaps be manipulated by a computer pinging Wikipedia again and again, which could throw off an algorithm pulling page view data in real time.
And tweets can be all over the place — there’s no style guide or fact checking for Twitter. So getting a good read on sentiment based on tweets from, say, Stocktwits can be hit or miss. And Google’s Flu Trends feature, heralded as an early use of crowdsourced data, reportedly overestimated flu breakout late last year.
Clearly, there are caveats to these data sets. Still, it’s neat to see new models emerging on the uses of public data, and some people who want to make money off Wikipedia metadata might want experiment with it. Just don’t blame us if the experiments backfire.
(c) 2013, GigaOM.com.