Get ready for some incomprehensibly big numbers.
Michael Schatz, co-author of the study and a professor at Cold Spring Harbor Laboratory in New York, called the data challenge one of the most important questions facing biology today.
"Scientists are really shocked at how far genomics has come," Schatz said. "Big data scientists in astronomy and particle physics thought genomics had a trivial amount of data. But we're catching up and probably going to surpass them."
To give some idea as to the amount of data we're talking about, consider YouTube, which generates the most data of any source per year — around 100 petabytes, according to the study. A petabyte is a quadrillion (that's 10 followed by 15 zeroes) bytes, or about 1,000 times the average storage on a personal computer.
Right now, all of the human data generated through genomics — including around 250,000 sequences — takes up about a fourth of the size of YouTube's yearly data production. If the data were combined with all the extra information that comes with sequencing genomes and recorded on typical 4-gigabyte DVDs, Schatz said the result would be a stack about half a mile high.
But the field is just getting started. Scientists are expecting as many as 1 billion people to have their genomes sequenced by 2025. The amount of data being produced in genomics daily is doubling every seven months, so within the next decade, genomics is looking at generating somewhere between 2 and 40 exabytes a year.
A exabyte — just try to wrap your mind around this — is 1,000 petabytes, or about 1 million times the amount that can be stored on a home computer. In other words, that aforementioned stack of DVDs would easily start reaching into space.
The study gives a good illustration of how the microscopic details of human genetics rival the complexity of the far-reaching science of the universe. The mountain of data used to analyze human DNA is so large that Schatz jokes people will eventually have to substitute the term "astronomical" with a more appropriate word: "genomical."
"With all of this information, something new is going to emerge," he said. "It might show patterns of how mutations affect different diseases."
IBM's Watson Genomics initiative, for example, is crunching data on the entire genomes of tumors, with the hope of generating personalized medicine for cancer patients.
At some point, scientists might be able to save space by not storing sequences in full, similar to the way data is managed in particle physics, where information is read and filtered while it is generated. But at this point, the study says, such data cropping isn't as practical because it's hard to figure out what future data physicians will need for their research — especially when looking at broader human populations.
Right now, most genome research teams store their data through on-site hard drive infrastructure. The New York Genome Center, for example, is generating somewhere between 10 to 3o terabytes of data a day and storing it in an on-site system. They move old data they don't regularly use to cheaper and slower storage.
"At this point, we're continuously expanding file storage," said Toby Bloom, deputy scientific director at the center. "The biggest hurdle is keeping track of what we have and finding what we need."
Organizations like Bloom's are eyeing the possibility of moving the data to cloud storage, but she said that's currently not as cost effective as expanding their physical storage infrastructure.
But size is not the only problem the field faces. Biological data is being collected from many places and in many different formats. Unlike Internet data, which is formatted relatively uniformly, the diverse sets of genomic data makes it difficult for people to use them across datasets, the study says.
Companies like Amazon and Google are developing the infrastructure to put genomic data on public clouds, which would be especially helpful for smaller centers with limited IT staff, but could also help foster collaboration.
Google recently announced a partnership with the Broad Institute of MIT and Harvard aimed at providing its cloud services for scientists combined with a toolkit developed by the institute that can be used to analyze the data. The concept is to put a bunch of the world's genomic data on Google's servers, where scientists from all over can collaborate on a single platform.
"It's extremely likely to see (the cloud model) going forward," Schatz said. "It just makes more sense."