A while ago, I blogged about my Alabama Law Review article, Do Faith-Based Prisons Work?; here’s the fifth installment of the five-part series of blog posts, which has links to the previous four posts.

I’ve just discovered a new study — it’s from 2013, two years after my article appeared. It’s called Can Faith-Based Correctional Programs Work?, by Grant Duwe and Michelle King. Here’s the abstract (paragraph breaks added):

This study evaluated the effectiveness of the InnerChange Freedom Initiative (InnerChange), a faith-based prisoner reentry program, by examining recidivism outcomes among 732 offenders released from Minnesota prisons between 2003 and 2009.

Results from the Cox regression analyses revealed that participating in InnerChange significantly reduced reoffending (rearrest, reconviction, and new offense reincarceration), although it did not have a significant impact on reincarceration for a technical violation revocation. The findings further suggest that the beneficial recidivism outcomes for InnerChange participants may have been due, in part, to the continuum of mentoring support some offenders received in the institution and the community.

The results imply that faith-based correctional programs can reduce recidivism, but only if they apply evidence-based practices that focus on providing a behavioral intervention within a therapeutic community, addressing the criminogenic needs of participants and delivering a continuum of care from the institution to the community.

Given that InnerChange relies heavily on volunteers and program costs are privately funded, the program exacts no additional costs to the State of Minnesota. Yet, because InnerChange lowers recidivism, which includes reduced reincarceration and victimization costs, the program may be especially advantageous from a cost-benefit perspective.

How much faith should we put in this study? To know this, we have to know how the study addresses the self-selection problem. Programs are voluntary: no one participates unless they want to. And people who participate in a program might be people who have sufficient motivation to want to change — which is probably correlated to some degree with being a “better person”, in the sense of a person who’s less likely to reoffend later.

So if participants in a program turn out to have lower recidivism than non-participants, it could be because the program worked; or it could be because the program had zero effect but merely attracted a better sort of prisoner. (Or it could even be that the program had a negative effect (it made people worse)! But it attracted better prisoners to begin with, and those better prisoners still remained slightly better after experiencing the negative effect of the program.)

How do we solve this problem? One way that people have tried is comparing participants against a matched group of non-participants. For every participant, you find a non-participant with identical race and gender, and matched as closely as possible on age, severity of offense, and other observable factors. But here’s the thing: you can only match on observable factors, because by definition a statistician evaluating the program doesn’t see a prisoner’s unobservables. And an important unobservable factor is the prisoner’s motivation to change. If you have two prisoners who look identical on observable factors, but one of whom participated in a program and the other of whom didn’t, why did they make different decisions as to participation? Perhaps because one had greater motivation. So you still haven’t solved the problem of self-selection. Your program might still have attracted better prisoners to begin with, so we can discount any positive effects you find.

A fancier statistical technique that some have tried is propensity score matching. This is exactly what Duwe and King do in this new study:

In this study, we evaluate the InnerChange program for male offenders that has operated in Minnesota’s prison system since 2002. We assess the effectiveness of InnerChange by comparing recidivism outcomes among 366 offenders who participated in the program and 366 offenders who were eligible but did not participate. The 732 offenders were released from Minnesota prisons between 2003 and 2009, and outcome data were collected through 2010, resulting in an average follow-up period of 3 years. To minimize observable selection bias, we used propensity score matching (PSM) to individually match the nonparticipants with those who entered InnerChange.

Duwe and King, being good researchers, carefully note the limitations of propensity score matching in their paper (paragraph breaks added):

In matching InnerChange participants with nonparticipants on the conditional probability of entering InnerChange, PSM reduces selection bias by creating counterfactual estimate of what would have happened to the InnerChange offenders had they not participated in the program.

PSM has several limitations, however, that are worth noting.

First, and foremost, because propensity scores are based on observed covariates, PSM is not robust against “hidden bias” from unmeasured variables that are associated with the assignment to treatment and the outcome variable. For example, given that InnerChange is a voluntary program, PSM would be unable to control for unobserved covariates arising from self-selection bias that have significant effects on selection to the program and recidivism.

Second, there must be substantial overlap among propensity scores between the two groups for PSM to be effective (Shadish, Cook, & Campbell, 2002); otherwise, the matching process will yield incomplete or inexact matches.

Finally, as Rubin (1997) points out, PSM tends to work best with large samples.

The authors are right; and it’s the first limitation that is the most problematic in the faith-based prison context. As the authors say: If you have “unobserved covariates” that affect both “selection to the program and recidivism”, you get a “hidden bias” to which “PSM is not robust”. The authors say that “an attempt was made to address potential concerns over unobserved bias by including as many theoretically relevant covariates (27) as possible in the propensity score model,” but this still doesn’t address the concern as long as the important motivation variable isn’t adequately reflected by those 27 extra variables.

Similarly, in my paper, I explain why propensity score matching still doesn’t solve the self-selection problem — because the propensity score still doesn’t get at the unobservable factors that make some participate and other not:

In propensity score matching, the researchers first identify the observable variables that best predict whether someone will participate in the program. This first-stage estimation generates a “propensity score” for each inmate; this is essentially an estimated probability of participating in the program. One inmate may have participated and the other may have not, but they may both have propensity scores of, say, 70%, so that they are estimated to be equally likely, ex ante, to have chosen to participate.

The matching process then matches each participant to another participant with a similar propensity score; a 70% propensity participating inmate is matched with a 70% propensity non-participating inmate, even if these inmates may differ on various individual characteristics.

Practitioners of propensity score matching point to certain advantages of the method over trying to match on observable variables directly. Given a participant with particular observable characteristics, it is often hard or impossible to find a non-participant with identical, or nearly identical, values of those same variables; by contrast, it is easier to match according to a single number.

But propensity score matching can’t overcome the problems of selection bias in the case of faith-based prisons. To see this, suppose that there were so many non-participating prisoners that exact matching on observables was always possible; every participating inmate would be matched with a non-participant who looked exactly identical. Because these two inmates would have identical observable characteristics, they would also have identical propensity scores. Matching on propensity scores would then produce exactly the same control group as the previous set of studies, which matched on observables directly.

Thus, if the direct matching studies weren’t credible, the propensity score matching studies aren’t credible either. Using the propensity score may improve the efficacy of matching, but it doesn’t alleviate the self-selection problem.

More technically, the problem is that propensity score methods give the correct result if nonobservables play no role in the selection mechanism, or more precisely, if the unobserved determinants of participation play no role in ultimate success (that is, low recidivism). This assumption is quite false in the case of faith-based prison programs, where motivation to change, and possibly religiosity itself, both determine participation in the program and play a large role in whether an inmate reoffends. James Heckman and Richard Robb argue that “[t]he propensity score methodology solves a very special problem . . . that is of limited interest to social science data analysts.” Whether Heckman and Robb are right about the interest of propensity score studies in general, faith-based prison evaluation certainly seems like one area where the method doesn’t seem credible.

So, while I welcome further research into faith-based prisons, I don’t think results that derive from propensity-score matching are highly credible.