The Washington PostDemocracy Dies in Darkness

Researchers replicate just 13 of 21 social science experiments published in top journals

The "reproducibility crisis" in science is erupting again. A research project attempted to replicate 21 social science experiments published between 2010 and 2015 in the prestigious journals Science and Nature. Only 13 replication attempts succeeded. The other eight were duds, with no observed effects consistent with the original findings.

The failures do not necessarily mean the original results were erroneous, as the authors of this latest replication effort note. There could have been gremlins of some type in the second try. But the authors also noted that even in the replications that succeeded, the observed effect was on average only about 75 percent as large as the first time around.

The researchers conclude that there is a systematic bias in published findings, “partly due to false positives and partly due to the overestimated effect sizes of true positives.”

The two-year replication project, published Monday in the journal Nature Human Behaviour, is likely to roil research institutions and scientific journals that in recent years have grappled with reproducibility issues. The ability to replicate a finding is fundamental to experimental science. This latest project provides a reminder that the publication of a finding in a peer-reviewed journal does not make it true.

Scientists are under attack from ideologues, special interests and conspiracy theorists who reject the evidence-based consensus in such areas as evolution, climate change, the safety of vaccines and cancer treatment. The replication crisis is different; it is largely an in-house problem with experimental design and statistical analysis.

Refreshingly, other scientists have a pretty good detector for which studies are likely to stand the test of time. In this latest effort, the researchers asked more than 200 peers to predict which studies would replicate and to what extent the effect sizes would be duplicated. The prediction market got it remarkably right. The study's authors suggest that scientific journals could tap into the “wisdom of crowds” when deciding how to treat submitted papers with novel results.

“I would have expected results to be more reproducible in these journals,” said John Ioannidis, a professor of medicine at Stanford. He was not involved in this new research but is closely associated with the issue of reproducibility because of his authorship of an influential and extraordinarily provocative 2005 article with the headline “Why Most Published Research Findings Are False.”

Simine Vazire, a University of California at Davis psychologist who is also active in the reproducibility movement, said the new project's replication success — 10 out of 17 experiments published in Science and 3 out of 4 published in Nature — “is not okay.” She said, “There’s no reason why the most prestigious journals shouldn’t demand pretty strong evidence,” and added that these experiments would not have been difficult to attempt to replicate before publication.

One of the studies that didn’t replicate attempted to study whether self-reported religiosity would change among test subjects who had first been asked to look at an image of the famous Auguste Rodin sculpture “The Thinker.” The study found that people became less religious after exposure to that image.

“Our study in hindsight was outright silly,” said Will Gervais, an associate professor of psychology at the University of Kentucky. Gervais said that his original study oversold a “random flip in the data,” although other parts of his paper did replicate.

The new project attempted to replicate an experiment published in Science in 2011 that found that digital search engines change the way people remember information. The study, which received widespread media coverage, including in The Washington Post, found that people struggle to remember things that they believe they can find online. As one psychologist told The Post at the time, “with Google and other search engines, we can offload some of our memory demands onto machines.” But the replication attempt did not see any such “Google effect.”

Another experiment, conducted in Boston in 2008 and published in Science in 2010, divided passersby into “heavy” and “light” groups and gave them either a heavy clipboard or a light clipboard containing the résumé of a job applicant. The original experiment found that people holding the heavier clipboard were more likely to rate applicants as suitable for the job. The replication found no such effect. (The replication protocol deviated slightly, in that it was conducted in Charlottesville and not Boston, and passersby were given $5 for their time rather than candy.)

The advocates for greater reproducibility believe that publication pressures create an environment ripe for false positives. Scientists need to publish, and journal editors are eager to publish novel, interesting findings.

Brian Nosek, the leader of this latest reproducibility effort, is executive director of the Center for Open Science, a nonprofit that promotes transparency and reproducibility in research. In an interview with The Washington Post, he acknowledged that the focus on false positives comes at a time when science is already under attack from special interests. But he said, “I think the benefits far, far outweigh the risks.”

He went on: “The reason to trust science is because science doesn't trust itself. We are constantly questioning the basis of our claims and the methods we use to test those claims. That’s why science is so credible.”

Nosek and his allies have drawn heat for their efforts. A major report led by Nosek and published in 2015 in Science found that only about 40 percent of 100 psychology experiments could be replicated (the precise percentage depended on how one defined a successful replication). But that report incited sharp criticism from Harvard psychologist Dan Gilbert and three other researchers, who in a letter to the journal argued that many of the replication experiments didn’t follow the original protocols.

Gilbert and his colleagues argued that, in fact, the results of the Nosek-led project were consistent with psychology experiments being largely replicable.

A statement issued by the journal Science pointed out that all the experiments scrutinized in this latest effort were published before a decision several years ago by Science, Nature and other journals to adopt new guidelines designed to increase reproducibility, in part by greater sharing of data. “Our editorial standards have tightened,” said the statement from Science.

Science deputy editor emeritus Barbara Jasny said in an interview that the failure to replicate studies does not mean that the original experiments were faulty, because “there are differences in protocol, there are differences in study samples.” She noted that the journal Science serves an interdisciplinary audience.

“We do judge on more than just technical competence. We look for papers that may have applications in different fields. We look for papers that are important advances in their own field,” she said.

She said it's important for graduate schools to have uniform methods for teaching students how to design experiments and analyze statistics and advocated more funding for replication studies.

“You can say, 'Oh, this is terrible, it didn’t replicate.' Or you could say: 'This is the way science works. It evolves. People do more studies,' ” Jasny said. “Not every paper is going to be perfect when it comes out.”

The journal Nature released a statement saying that it has been working with the scientific community to raise standards for reproducibility. The journal since 2013 has required authors of submitted papers to go through a checklist to ensure that they have explained their experimental design and analysis. “Journals, research laboratories and institutions and funders all have roles to play in fostering reproducibility,” the journal said.

Read more:

The new scientific revolution: Reproducibility at last

Researchers struggle to replicate 5 cancer experiments