A surprisingly high number of scientific papers in the field of genetics contain errors introduced by Microsoft Excel, according to an analysis recently published in the journal Genome Biology.
A team of Australian researchers analyzed nearly 3,600 genetics papers published in a number of leading scientific journals — like Nature, Science and PLoS One. As is common practice in the field, these papers all came with supplementary files containing lists of genes used in the research.
The Australian researchers found that roughly 1 in 5 of these papers included errors in their gene lists that were due to Excel automatically converting gene names to things like calendar dates or random numbers.
You see, genes are often referred to in scientific literature by symbols — essentially shortened versions of full gene names. The gene "Septin 2" is typically shortened as SEPT2. "Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase" gets mercifully shortened to MARCH1.
But when you type these shortened gene names into Excel, the program automatically assumes they refer to dates — Sept. 2 and March 1, respectively. If you type SEPT2 into a default Excel cell, it magically becomes "2-Sep." It's stored by the program as the date 9/2/2016.
Even worse, there's no easy way to undo this automatic formatting once it has happened. Edit -> Undo simply deletes everything in the cell. You can try to convert the formatting from "General," the default, to "Text," which you might expect to change it back to the original characters you enter. But instead, changing the formatting to "Text" makes the cell contents appear as 42615 — Excel's internal numeric code referring to the date 9/2/2016.
Even more troubling, the researchers note that there's no way to permanently disable automatic date formatting within Excel. Researchers still have to remember to manually format columns to "Text" before you type anything in new Excel sheets — every. single. time.
But even the genetics researchers among us are only human, and they sometimes forget to do this. Hence, you end up with 20 percent of these genetics papers containing preventable errors introduced by Excel.
The Australian researchers note that this problem was first identified in a paper published more than a decade ago. "Nevertheless, we find that these errors continue to pervade supplementary files in the scientific literature," they write.
Genetics isn't the only field where a life's work can potentially be undermined by a spreadsheet error. Harvard economists Carmen Reinhart and Kenneth Rogoff famously made an Excel goof — omitting a few rows of data from a calculation — that caused them to drastically overstate the negative GDP impact of high debt burdens. Researchers in other fields occasionally have to issue retractions after finding Excel errors as well.
The Australian researchers note that Excel isn't the only spreadsheet program with overly aggressive autoformatting issues — the same errors crop up in open-source programs like LibreOffice Calc and Apache OpenOffice Calc too.
They do note, however, that one perfectly free spreadsheet program did not have any issues storing the gene names as typed — Google Sheets.
For the time being, the only fix for the issue is for researchers and journal editors to remain vigilant when working with their data files. Even better, they could abandon Excel completely in favor of programs and languages that were built for statistical research, like R and Python.
More from Wonkblog: