The big data revolution may soon be over.
Companies and governments will still continue to collect data, of course, and computing power will continue to grow. But vastly larger sets of data, even if collected more quickly and effectively, won’t answer all our questions or solve our problems as they were once promised to do. This failure shouldn’t surprise us, however, and we can see why by looking at one of the most heralded venues for data analysis: baseball.
Baseball has a long history of technological innovation, promising that the collection of more data, or the right kind of data, would transform the game. But at each moment, the unquenchable thirst for more and better statistics shows how data revolutions actually generate new questions more often than they do solutions, and how those questions, in turn, generate the need for even more data.
In the 1860s, pioneering reporter Henry Chadwick promoted his new scoresheets and scoring system as essential technologies for collecting data. Though teams were already tracking runs and outs, such meager data, he claimed, was insufficient for a true analysis of the game. Only by tracking and systematically recording the “character” of each play — a “good” catch, a “clean” hit, an “earned” run — could there be enough information for accurate judgments to be made.
But Chadwick’s methods proved inadequate because they depended on a faulty system of individual informants tracking scores. In the early years of the 20th century, therefore, National League secretary (and later president) John Heydler created a system of ledgers for each team and player, enabling tracking of both season-wide and career-wide statistics. By centrally managing the collection of data rather than delegating it to newspapers or private interests, he would be able to create “comprehensive” records far superior to existing data operations, one admirer noted.
Heydler and his colleagues promised the new official repository would solve the problems of inconsistent and disjointed statistical records scattered around the country, allowing fans and management to trust the statistics. The quality of an individual player, or the worth of a team, could now reliably be ascertained with a single glance at the tables.
But it was not enough, at least not according to engineer Pete Palmer, who worked in the 1960s and 1970s to create the first database of baseball statistics. By translating the various official and unofficial records into punched cards, and then processing them on then-novel electronic supercomputers, Palmer boasted he could use the computer to be sure the year-end and career totals added up correctly. If there was a discrepancy between a team’s totals and those of its combined players, for example, the computer could reveal the problem.
By using the latest technology to process existing data — one of Palmer’s employers, Systems Development Corporation, was a pioneer of the “database” — Palmer now claimed the revolutionary ability to electronically access statistical data. He wanted to use computers to check the quality of data across dozens of seasons and thousands of players. By thinking in terms of a database rather than an inert list of facts, Palmer emphasized the importance of data recall and organization, and Palmer’s own database would become the core of today’s most definitive source of statistics, Baseball Reference.
But within a decade, these records were already irredeemably insufficient. In an era of free agency and exploding salaries, teams needed more precise information and better capacity to analyze it. Palmer’s database had been built on the official statistical summaries collected by the National and American Leagues, but the leagues had never systematically collected any play-by-play data. Because such information was not publicly available, even basic questions — such as whether one player or strategy succeeded more often in certain situations than another — couldn’t be answered.
In response, Bill James’s Project Scoresheet launched in 1984, combining a new form of the scoresheet with a network of scorers watching every game across the country, and then deploying the latest technology — personal computers — to collect and publish the data. Again a revolution was proclaimed, now in the ability to use data to analyze in-game strategy and individual player contributions to figure out if players justified their salaries, or why one group of players was more successful than another. What contributions, in other words, produced success?
James’s methods and data did produce serious changes in the game, highlighting the importance of on-base percentage and defensive contributions and challenging the inefficient use of relief pitchers. Project Scoresheet itself eventually furnished crucial technology and infrastructure through which the for-profit company STATS became the leading purveyor of daily baseball data for media outlets in the 1990s.
Yet even instantaneous play-by-play information would prove insufficient. In 2014, MLB unveiled Statcast, its own radar- and video-based data collection system, in which the position and movement of every player and ball could be recorded and analyzed. It was baseball’s first taste of truly “big data,” and proponents quickly heralded the fact that the amount of data Statcast provided in its first full season was far larger than the total amount collected throughout the history of the game. Following the so-called moneyball revolution, which emphasized the value of statistical analysis for player acquisition and strategy, Statcast seemed to promise that there was finally enough data to fundamentally answer the game’s remaining questions.
But will it? History says no. Over the past century and a half, data revolutions have helped fans and managers better understand the game, but each was also deemed woefully inadequate within years of debuting.
The same is likely to happen to Statcast — and sooner than we might think. That’s not because some fatal technological flaw will emerge, but rather because of the nature of data itself. Data are, in essence, the things we rely on to make arguments and answer questions. And the questions we ask inevitably change over time.
There’s no doubt Statcast and similar data-collection efforts have changed the game from the days when runs batted in and earned run average dominated our understanding of players. Right now, the latest trends in baseball are defensive shifts, launch angles, exit velocity and “true” outcomes. But soon we’ll ask new questions for which new data will be required. It’s not that we haven’t learned anything, but rather that we’ll never learn everything.
Some of the most important questions today are about integrating qualitative and quantitative data: playing statistics may be useful for valuing current major leaguers, but they’re nearly useless for assessing the future value of amateurs who play against far inferior competition and whose ability may change dramatically over time. Teams have to learn to use data like exit velocity and launch angle to augment the judgments of scouts and make the inexact science of player evaluation a bit more precise.
The next baseball data revolution will also come in part because the data currently being collected won’t just capture the game as it is, but also will change the sport. If teams realize that speedy players are undervalued, for example, or that defensive shifts work, they adjust, and those changes will in turn affect what data is collected and what it means.
Similar stories could be told about the role of data in scores of other areas, from medical decision-making to political campaigns. In each case, developments that seemed to fundamentally transform the field through novel technologies of data collection were soon written off as old news.
When the concept of “big data” first emerged two decades ago, the promise was clear: Computing power had gotten so fast, storage so cheap and statistical tools so powerful that computers could offer up new kinds of analysis that could guide society to a better place. Now, with some historical perspective, we can begin to see how the rise of big data in baseball and beyond merely echoes the excitement surrounding previous developments.
For centuries data-driven reformers have repeatedly claimed some new technology will finally collect enough data to solve our pressing problems, but it hasn’t come to be. The era of big data isn’t over so much as its differences with previous epochs seem far less salient. Big data will always be with us, but arguably it also always has.