Computer algorithms are being used to determine the amount of bail or whether or not to grant parole. These programs are designed to be race-neutral, some worry they are biased, too. Wonkblog's Max Ehrenfreund takes a look at the data. (Daron Taylor/The Washington Post)

Judges around the country are now relying on algorithms to help them decide whether defendants should be let free on bail, how long those convicted should be imprisoned and more. For many reformers, the machines promise to remove human racial biases from the courtroom.

Asking a computer to predict whether someone will commit a crime in the future raises troubling questions, however, as a recent report from the investigative outlet ProPublica shows. Often, the courts rely on proprietary algorithms that are designed by for-profit firms and concealed from the public. Some worry that the machines in the courtroom are as biased as the humans.

ProPublica's report was critical of an algorithm used in Fort Lauderdale, Fla., and the surrounding county. The proprietary algorithm that was the subject of the report, called Correctional Offender Management Profiling for Alternative Sanctions, or COMPAS, estimates the likelihood that a defendant will be arrested again, based on demographic data and their responses to a questionnaire. The defendant's race is not used in the calculation.

The journalists at ProPublica found that, among those defendants not arrested again within two years, the algorithm had classified the black defendants as higher risk at about twice the rate of white defendants: 45 percent compared to 23 percent.

Recently, though, ProPublica's study has received criticism from experts in the field, who say those statistics — while technically accurate — are misleading.

Three researchers released a draft paper last month that concludes there was no statistical evidence of racial bias in ProPublica's data. The algorithm "predicts recidivism in a very similar way" for both white and black defendants, the paper states. In other words, classifications for black defendants were no less accurate than scores for white defendants. Both groups received ratings that, on average, reflected their actual risk of recidivism.

The bias that ProPublica identified is not produced by the algorithm itself, the results suggest. Instead, those statistics could reflect broader racial disparities in the criminal justice system and society in general.

"Perhaps, what looks to be bias is not in the tool — it’s in the system," said Anthony Flores, a criminologist at California State University at Bakersfield and one of the authors of the paper.

For Flores, algorithms such as COMPAS have the potential to reduce one source of bias: the decisions judges make on bail and sentencing. Yet there are many sources of bias that contribute to the disproportionate rate of incarceration among African Americans, including decisions by prosecutors, police officers, teachers, businesses and banks.

The result is a statistical paradox. Systemic racial injustices can be reflected in software that holds the promise of greater equality.

How the software works

ProPublica contends that the software, developed by a firm called Northpointe, was "somewhat more accurate than a coin flip" when it came to forecasting crime, and that it produced "significant racial disparities." The independent investigative journalism outlet filed a public-records request for COMPAS data from the Broward County Sheriff's Office in Fort Lauderdale to determine whether the algorithm was accurate and unbiased.

In Broward County, judges take the algorithm into consideration when making decisions about whether to lock up defendants pending trial or to let them go on bail.

Northpointe's questionnaire includes 137 items about a defendant's life and family history, according to ProPublica. The defendant answers some of the questions, and other information is drawn from court records.

With this data, Northpointe's formula produces a score for each defendant on a scale from 1 to 10, with higher numbers indicating a greater likelihood that the defendant will be arrested again. For its analysis, ProPublica contrasted defendants rated "high risk" or "medium risk" — those with scores of 5 and above — with "low risk" defendants with lower scores.

The algorithm produces these scores by comparing each defendant's profile to data from past cases. The scores are supposed to show how frequently other defendants who were in a similar position in life, broadly speaking, went on to commit additional crimes.

Consider two 25-year-old male defendants arrested on charges of driving under the influence. Neither have any history of violent crime. The first defendant has a bachelor's degree and a job. The second defendant is unemployed and dropped out of high school.

To the extent that a lack of employment and education have been associated with recidivism in past cases, an unbiased algorithm would predict that the second defendant's risk of recidivism is greater than the first defendant's.

Race is not part of the calculation, but many characteristics treated by the algorithm as indicators of potential recidivism are more common among black defendants — parental incarceration, frequent home relocations, living in a high-crime neighborhood and school suspensions, for instance. As a result, the algorithm scores more of them as being at higher risk.

What the data showed

That black defendants received worse scores is not in itself evidence of bias, however.

The racial disparity simply indicates that black defendants had more of the traits associated with recidivism in past cases. In the example above, if the second defendant is black, the difference in scores would not necessarily be evidence of racial bias. Instead, it could reflect actual difference in risk for the two defendants, based on past experience.

These scores indicating greater risk reflected the fact that black defendants were more likely to be arrested again. It could be that the black defendants were more likely to commit new crimes because of broader social injustices — unequal access to education, employment and more. It also could be that white defendants and black defendants broke the law with equal frequency, but that the white defendants were more likely to get away with it.

The same critique applies to the figures specifically for those defendants who were not arrested again, which ProPublica cited as evidence of bias.

In this group, the algorithm had classified black defendants as higher risk at about twice the rate of white defendants. (The journalists found a similar discrepancy among those defendants who were arrested again. In this group, the algorithm had classified about 48 percent of white defendants as at a lower risk of recidivism, compared to 28 percent of black defendants.)

Yet these disparities are not necessarily evidence that the software is skewed, either. Instead of a bias in the algorithm, they could reflect a real difference in the risk of recidivism among black defendants in this group, based on past experience.

That is, among defendants who were not arrested again, the typical black defendant might have had to overcome longer odds than the typical white defendant to stay out of jail. The same inequities that likely explain why black defendants received worse scores in general could also account for the worse scores specifically among those who are not arrested again.

It might seem that defendants who are not arrested again should all receive similar scores on average, but consider the example of the 25-year-old male defendants above and suppose that neither has a new arrest. It would have been difficult to predict that outcome in advance for the second defendant, knowing that he was unemployed with no high school diploma. A racially unbiased observer likely would have concluded that he was more likely to be arrested again, based on his similarity to other defendants who were.

Why the software seems biased

Suppose the algorithm were systematically giving bad scores to black defendants who were not at a real risk of being arrested again. Many of them would not, in fact, be arrested again — and the actual rate of recidivism for black defendants who received bad scores would be lower than the rate for white defendants who received the same scores.

ProPublica's data, however, shows that white and black defendants who received the same score did pose similar real risks, as Flores and his colleagues noted.

Consider the defendants that received a 10 from the algorithm, the worst possible score. There were more than four times as many black defendants in this group than white defendants. Some of the black defendants did not merit this score, so to speak: About 21 percent were not arrested again within two years.

Yet that was even more true of the white defendants, 30 percent of whom were not arrested again. The algorithm is not systematically assigning more 10s to black defendants who are not at high risk.

In total, among those classified as at higher risk of recidivism, a similar proportion were arrested again in both groups — 63 percent of black defendants and 59 percent of white defendants. Among those classified as at lower risk, 65 percent of black defendants and 71 percent of white defendants were not arrested again.

Yet since more black defendants were classified as higher risk in general, there was also a racial disparity in scores specifically among those who were not arrested again. These patterns — rather than a difference in the algorithm's performance for defendants of different races — apparently produced the discrepancy cited by ProPublica among those defendants not arrested again.

One intuitive way of thinking about whether an algorithm is biased is to ask whether it would treat two defendants the same if they were similar in most ways except for their race. In other words, could a black defendant expect a better score from the algorithm if his race and nothing else about him changed?

These figures are evidence that if nothing about a defendant changed except his race, he would likely receive the same score — since the algorithm appears to be classifying defendants not by their race, but by personal traits that are correlated with recidivism.

(ProPublica sought to examine how the algorithm handled similar black and white defendants with a technique known as a logistic regression. The regression compared defendants who were similar in terms of age, gender, prior criminal history and the degree of the charge at the hearing and concluded that black defendants were more likely to receive higher-risk scores after adjusting for these factors. Yet this regression did not address any of the rest of the multitude of factors that could be associated with a person's tendency to commit crime.)

"The tool is a valid predictor," Flores said. "It's not biased."

Northpointe came to similar conclusions in an extensive rebuttal to ProPublica's article last month. "Their statistical analysis was inaccurate," said Jeffrey Harmon, the firm's general manager.

What's at stake

ProPublica's staff argues that the disparities cited in their article are problematic, even if they do not indicate predictive bias.

"If you look at predictive accuracy, in fact, the test is equal for blacks and whites," said Julia Angwin, one of the reporters who wrote the story. "That's, in our minds, not enough."

Defendants who go on to follow all the rules are wrongly impugned by the algorithm if they are classified as higher risk. Since the risk of recidivism is greater among black defendants, however, they suffer this harm disproportionately.

The pattern is a familiar one in debates about civil rights and the law. Policies that are designed and implemented in an unbiased way on an individual basis can have disparate effects when taking into account the divergent circumstances of minority populations.

In a context of inequity, Angwin pointed out, experts have no clear answer to the question of what makes an algorithm fair or unfair.

"This is a debate in the field. We have staked out a position," she said.

"A third of the time, somebody was misclassified," said Kevin Whiteacre, a criminologist at the University of Indianapolis. "If that misclassification results in freedoms taken away, then that has real implications."

Northpointe's algorithm "does in fact overclassify more blacks," Whiteacre added. "It underclassifies more whites, as a percentage, and that’s a big problem."

Jennifer Skeem, a psychologist at the University of California at Berkeley who has evaluated similar crime-prediction software for the state of California and the federal government, distinguished between the statistical question of bias and the moral one of fairness.

"Even a perfectly valid test might create concerns," Skeem said. "Even if the test is not biased, we still care about those differences in some sense, because they’re relevant to the moral issue."

Additionally, other ethical questions about the software's design cannot be answered with statistical techniques alone. There is ample room for debate about what it means for an algorithm to be fair — a question that is becoming all the more pressing in our increasingly quantified modern existence.

One example is the item on Northpointe's questionnaire about whether the defendant's parents were incarcerated. Even if this information makes the algorithm's predictions more accurate, defendants might find it unfair for a judge to make decisions about bail for them based on what their parents did.

"The real question should be, what is the data you’re allowed to use about someone?" said Cathy O'Neil, a private-sector mathematician who has concerns about Northpointe's product. "We urgently need to have this conversation."

What's kept secret

The thorny issues raised by software such as Northpointe's is a compelling reason to make the formulas public, or at least to subject them to rigorous, independent review, reformers say.

Last month, the Wisconsin Supreme Court ruled on a case challenging various aspects of Northpointe's algorithm, including its secrecy. The court sustained the use of the algorithm in criminal cases, but with a caveat: Along with any reports produced by the algorithm, judges must receive a note about the lack of transparency.

"The algorithm that’s used should be known and publicly available," said Cherise Fanno Burdeen, the executive director of the Pretrial Justice Institute. The Gaithersburg organization has helped several jurisdictions develop similar algorithms.

Harmon, the general manager at Northpointe, countered that the algorithms only produce information that a judge can consider alongside other arguments raised by the defense.

"These tools are not absolute," he said. "There's a human decision-maker that may take other factors into account."

Even imperfect machines might be an improvement over the humans in the courtroom. Judges' and prosecutors' mental processes are far more inscrutable than the most closely guarded proprietary algorithm. The biases that they bring to their work cannot be fixed with a few more lines of code.

The algorithms are "very scary. They’re very problematic in lots of different ways," said Phillip Atiba Goff, a psychologist at John Jay College of Criminal Justice in New York. "They’re also absolutely necessary."

More from Wonkblog:

There's been a big decline in the black incarceration rate, and almost nobody's paying attention

The states that spend more money on prisoners than college students

The black/white marijuana arrest gap, in nine charts