This post has been updated.

Maverick researchers have long argued that much of what gets published in elite scientific journals is fundamentally squishy — that the results tell a great story but can’t be reproduced when the experiments are run a second time.

Now a volunteer army of fact-checkers has published a new report that affirms that the skepticism was warranted. Over the course of four years, 270 researchers attempted to reproduce the results of 100 experiments that had been published in three prestigious psychology journals.

It was awfully hard. They ultimately concluded that they’d succeeded just 39 times.

The failure rate surprised even the leaders of the project, who had guessed that perhaps half the results wouldn’t be reproduced.

The new paper, titled "Estimating the reproducibility of psychological science," was published Thursday in the journal Science. The sweeping effort was led by the Center for Open Science, a nonprofit based in Charlottesville. The center's director, Brian Nosek, a University of Virginia psychology professor, said the review focused on the field of psychology because the leaders of the center are themselves psychologists.

Despite the rather gloomy results, the new paper pointed out that this kind of verification is precisely what scientists are supposed to do: “Any temptation to interpret these results as a defeat for psychology, or science more generally, must contend with the fact that this project demonstrates science behaving as it should.”

The phenomenon -- irreproducible results -- has been a nagging issue in the science world in recent years. That's partly due to a few spectacular instances of fraud, such as when Dutch psychologist Diederik Stapel admitted in 2011 that he’d been fabricating his data for years.

A more fundamental problem, say Nosek and other reform-minded scientists, is that researchers seeking tenure, grants or professional acclaim feel tremendous pressure to do experiments that have the kind of snazzy results that can be published in prestigious journals.

They don’t intentionally do anything wrong, but may succumb to motivated reasoning. That’s a subtle form of bias, like unconsciously putting your thumb on the scale. Researchers see what they want and hope to see, or tweak experiments to get a more significant result.

Moreover, there's the phenomenon of "publication bias.” Journals are naturally eager to publish significant results rather than null results. The problem is that, by random chance, some experiments will produce results that appear significant but are merely anomalies – spikes in the data that might mean nothing.

Reformers like Nosek want their colleagues to pre-register their experimental protocols and share their data so that the rest of the community can see how the sausage is made. Meanwhile, editors at Science, Nature and other top journals have crafted new standards that require more detailed explanations of how experiments are conducted.

Gilbert Chin, senior editor of the journal Science, said in a teleconference this week, “This somewhat disappointing outcome does not speak directly to the validity or the falsity of the theories. What it does say is that we should be less confident about many of the experimental results that were provided as empirical evidence in support of those theories.”

John Ioannidis, a professor of medicine at Stanford, has argued for years that most scientific results are less robust than researchers believe. He published a paper in 2005 with the instantly notorious title, "Why Most Published Research Findings Are False."

In an interview this week, Ioannidis called the new paper “a landmark for psychological science” and said it should have repercussions beyond the field of psychology. He said the paper validates his long-standing argument, “and I feel sorry for that. I wish I had been proven wrong.”

The 100 replication attempts, whether successful or unsuccessful, do not definitively prove or disprove the results of the original experiments, noted Marcia McNutt, editor-in-chief of the Science family of journals. There are many reasons that a replication might fail to yield the same kind of data.

Perhaps the replication was flawed in some key way – a strong possibility in experiments that have multiple moving parts and many human factors.

And science is conducted on the edge of the knowable, often in search of small, marginal effects.

“The only finding that will replicate 100 percent of the time is one that’s likely to be trite and boring and probably already known,” said Alan Kraut, executive director of the Association for Psychological Science. “I mean, yes, dead people can never be taught to read.”

One experiment that underwent replication had originally showed that students who drank a sugary beverage were better able to make a difficult decision about whether to live in a big apartment far from campus or a smaller one closer to campus. But that first experiment was conducted at Florida State University. The replication took place at the University of Virginia. The housing decisions around Charlottesville were much simpler -- effectively blowing up the experiment even before the first sugary beverage had been consumed.

Another experiment had shown, the first time around, that students exposed to a text that undermined their belief in free will were more likely to engage in cheating behavior. The replication, however, showed no such effect.

The co-author of the original paper, Jonathan Schooler, a psychologist at the University of California at Santa Barbara, said he still believes his original findings would hold up under specified conditions, but added, “Those conditions may be more narrowly specified than we originally appreciated.”

He has himself been an advocate for improving reproducibility, and said the new study shouldn’t tarnish the reputation of his field: “Psychology’s really leading the charge here in investigating the science of science.”

Nosek acknowledged that this new study is itself one that would be tricky to reproduce exactly, because there were subjective decisions made along the way and judgment calls about what, exactly, "reproduced" means. The very design of the review injected the possibility of bias, in that the volunteer scientists who conducted the replications were allowed to pick which experiments they wanted to do.

“At every phase of this process, decisions were made that might not be exactly the same kind of decision that another group would make,” Nosek said.

There are about 1.5 million scientific studies published a year, he said. This review looked at only 100 studies.

That’s a small sample size – another reason to be hesitant before declaring the discovery of a new truth.

Further Reading: