According to University of California, Santa Cruz researcher David Haussler, the limited access that many geneticists and computer scientists like himself have to valuable genetic data is “a crime.”
“We are on the brink of a real new understanding of cancer by being able to sequence cancer genomes,” he told me during a recent interview, but big data will be the key to unlocking it.
There are 1.6 million cases of cancer in the United States every year, Haussler explained, and most of the information from those tumors is being ignored. This is partially because of privacy restrictions about who can access personal medical data and for what purposes, and partially because there isn’t yet a concerted effort to collect the necessary genetic samples. As genome sequencing gets faster and cheaper, he says researchers need access to healthy and cancerous samples from the same person — and as many of these samples as possible — in order to analyze the “astounding” number of molecular changes that occur in every type and variation of cancer.
“We can’t completely understand what we’ll find, but we know we the only way we’ll pull out signal from the noise is to [analyze all these genes],” Haussler said.
Haussler understands the need for privacy regulations, but thinks there’s an opportunity to at least ease some current restrictions on how researchers access data. Even when there are relatively large (if not ideal) datasets available such as with the Cancer Genome Atlas project, researchers must apply to the National Institutes of Health for access, and the data must always remain behind an organizational firewall. Every cancer patient in the country could agree to having their data available to researchers, he said, but as long as that data isn’t accessible over the internet it’s only of limited utility.
He — along with others in the field — thinks cloud computing could be the solution because it gives genetic researchers a central location where they can access and perform computations on the data. Haussler and his team that house the Cancer Genome Atlas and a couple other projects currently have more than 400 terabytes of data and expect to have around 5 petabytes of data eventually. Downloading that is infeasible save for access to high-speed research networks, so “we need a place where people can experiment with these big data problems,” Haussler said.
In the meantime, Haussler and his peers will keep on collecting and accessing genome data however they can. And they’ll keep building software packages and algorithms that analyze that data better and faster than ever before. However, he lamented, “If we had the big data out there in an unrestricted setting, then all the best minds in the world would already be crunching on it.”
(c) 2012, GigaOM.com.