Should an algorithm help make decisions about whom to release before trial, whom to release from prison on parole or who receives rehabilitative services? They’re already informing criminal justice decisions around the United States and the world and have become the subject of heated public debate. Many such algorithms rely on patterns from historical data to assess each person’s risk of missing their next court hearing or being convicted of a new offense.

More than 60 years of research suggests that statistical algorithms are better than unaided human judgment at predicting such outcomes. In 2018, that body of research was questioned by a high-profile study published in the journal Science Advances, which found that humans and algorithms were about equally as good at assessing who will reoffend. But when we attempted to replicate and extend that recent study, we found something different: Algorithms were substantially better than humans when used in conditions that approximate real-world criminal justice proceedings.

Some researchers say that untrained people can match the performance of algorithms

In the 2018 study, Dartmouth College researchers recruited 400 people online to read 50 short descriptions of real defendants and asked participants whether they thought each person would be arrested within two years. These untrained recruits correctly predicted outcomes in 62 percent of cases, while COMPAS, one widely used risk-assessment algorithm, was about 65 percent accurate.

The authors concluded that COMPAS “is no more accurate … than predictions made by people with little or no criminal justice expertise” and argued that their results “cast significant doubt on the entire effort of algorithmic recidivism prediction.” In plain language, they contend that their results suggest that algorithms are no better than people at predicting which defendants will reoffend.

That message attracted considerable media attention and was taken up by those worried about using algorithms to inform decisions about the fate of people’s lives.

Surprised by the finding, we redid and extended the Dartmouth study with about 600 participants similarly recruited online. This past month, we published our results.

The Dartmouth findings do not hold in settings that are closer to real criminal justice situations

The problem isn’t that the Dartmouth study’s specific results are wrong. We got very similar results when we reran the study by asking our own participants to read and rate the same defendant descriptions that their researchers used. It’s that their results are limited to a narrow context. We repeated the experiment by asking our participants to read descriptions of several new sets of defendants and found that algorithms outperformed people in every case. For example, in one instance, algorithms correctly predicted which people would reoffend 71 percent of the time, while untrained recruits predicted correctly only 59 percent of the time — a 12 percentage point gap in accuracy.

This gap increased even further when we made the experiment closer to real-world conditions. After each question, the Dartmouth researchers told participants whether their prediction was correct — so we did that, too, in our initial experiments. As a result, those participants were able to immediately learn from their mistakes. But in real life, it can take months or years before criminal justice professionals discover which people have reoffended. So we redid our experiment several more times without this feedback. We found that the gap in accuracy between humans and algorithms doubled, from 12 to 24 percentage points. In other words, the gap increased when the experiment was more like what happens in the real world. In fact, in this case, where immediate feedback was no longer provided, our participants correctly rated only 47 percent of the vignettes they read — worse than simply flipping a coin.

Humans are worse at assessing risk than algorithms

Why was human performance so poor? Our participants significantly overestimated risk, believing that people would reoffend much more often than they actually did. In one iteration of our experiment, we explicitly and repeatedly told participants that only 29 percent of the people they were assessing ultimately reoffended, but our recruits still predicted that 48 percent would do so. In a courtroom, these “judges” might have incorrectly flagged many people as high risk who statistically posed little danger to public safety.

Humans were also worse than algorithms at exploiting additional information — something that criminal justice officials have in abundance. In yet another version of our experiment, we gave humans and algorithms detailed vignettes that included more than the five pieces of information provided about a defendant in the original Dartmouth study. The algorithms that had this additional information performed better than those that did not, but human performance did not improve.

Algorithms show promise but have limitations

Our results indicate that statistical algorithms can indeed outperform human predictions of whether people will commit new crimes. These findings are consistent with the findings of an extensive literature, including field studies, that show that algorithmic predictions are more accurate than those of unaided judges and correctional officers who make life-changing decisions every day.

Of course, policymakers may decide that risk simply should not factor into some legal decisions. Earlier this year, New York City adopted a groundbreaking policy of abolishing bail for nearly all defendants charged with nonviolent felonies, regardless of their risk of reoffending or failing to appear at court. Further, it’s important to ensure that algorithms don’t simply reflect bias in the data on which they are built, a concern we have examined in our broader research efforts. When, whether and how to consider algorithms, however accurate, remain larger issues of policy and ethics — and those decisions are firmly in human hands.

Zhiyuan “Jerry” Lin is a PhD candidate in computer science at Stanford University.

Jongbin Jung is a data scientist with a PhD from Stanford University.

Sharad Goel is an assistant professor at Stanford University and executive director of the Stanford Computational Policy Lab.

Jennifer Skeem is a professor of social welfare and public policy at the University of California at Berkeley and directs Berkeley’s Risk-Resilience Lab.