This continues yesterday’s post about the effectiveness of faith-based prisons, based on my recent Alabama Law Review article, Do Faith-Based Prisons Work? (Douglas Berman’s Sentencing Law and Policy Blog called this a “must read”; see also this discussion on Dru Stevenson’s Privatization Blog. This article is a companion article to Prison Vouchers and The Constitutional Possibilities of Prison Vouchers, though the ideas here are entirely independent of the vouchers idea.)

After yesterday’s introduction to the topic, today I’ll talk about how the self-selection problem makes any evaluation of faith-based programs with regular programs problematic. I’ll illustrate with some of the most problematic studies, which show the self-selection problem in its most naked form. I’ll then show some of the better studies, which control for certain important variables, but I’ll explain why even those are inadequate to solve the self-selection problem.

Perhaps most interestingly, at each stage, I’ll compare the faith-based prison studies with the private school studies. Education researchers have been trying to compare private and public schools for several decades, but for a long time these studies have been plagued with the same sort of selection problems present in the faith-based prison literature.

*     *     *

The most serious problem with studies of the effectiveness of faith-based prisons is the self-selection problem. Prisoners obviously select into faith-based prisons voluntarily. And the factors that would make an inmate select a faith-based prison may also make him less likely to commit crimes in the future. One such factor might be religiosity itself. In addition, an inmate who takes the trouble to choose to join a rehabilitative program may be more motivated and more open to change, and this may itself make him more likely to change—regardless of whether the program actually “works.”

The following Parts illustrate three types of studies that don’t adequately control for self-selection, both for faith-based prisons and for the analogous context of private/Catholic schools.

The first type of study shows the self-selection problem in its most naked form: it simply compares the results of participants in a faith-based program with those of non-participants.

The second type of study accounts for some of the differences between participants and non-participants by comparing the group of participants with a matched group of non-participants, where the matching is based on various observable factors like race, age, criminal history, and the like. But, of course, such a procedure can’t control for unobservable variables, like motivation to change.

The third type of study (which I’ll cover in tomorrow’s post) uses a more sophisticated statistical technique called “propensity score” matching. Participants are matched to non-participants not based on observable factors directly, but based on their propensity score, that is, their estimated probability of participating in the program. While propensity scores are a useful technique in some applications, they don’t alleviate the self-selection problem in the faith-based prison context.

A. Naked Self-Selection

The studies in this section purport to find a positive effect of faith-based prisons based on comparing, say, recidivism rates of participants in faith-based units and prisoners in the general population or in different prisons. But these sorts of studies aren’t credible because they make no effort to control for self-selection. “Without knowledge of the selection process, there is no way to determine whether observed differences between program participants and ‘comparisons’ are due to actual program effects or are an artifact of preexisting differences between the groups.” Rather than giving us the effect of faith-based prisons, these studies may be giving us the effect of faith-based prisoners.

1. Johnson’s Brazil Study

Byron Johnson compared recidivism among inmates in two Brazilian prisons: Humaitá, a faith-based facility, and Bragança, a secular facility with vocational training programs. Data wasn’t available for 46% of the inmates, though the data loss didn’t differ significantly between the two prisons. High-risk Humaitá inmates had significantly lower recidivism—12% of the high-risk Humaitá inmates were re-arrested after three years, versus 38% of the high-risk Bragança inmates. The average number of re-arrests was also significantly lower for Humaitá prisoners—even though on average the original offenses of the Humaitá prisoners had been more serious, they were more likely to be violent, and they had possibly served more time in prison.

The main problem with this study is that prisoners apply to be in Humaitá, prisoners’ families must be “involved in the prisoner’s recuperation process,” prisoners aren’t accepted without sufficient “motivation and commitment to change,” and prisoners don’t stay unless they and the prison agree after an initial 60-day assessment period. The results are thus tainted by multiple sources of bias: self-selection, selection by the prison itself, and success through the assessment period.

Moreover, among low-risk inmates, recidivism rates weren’t significantly different between the two prisons. There was no significant difference between times to re-arrest or the severity of the subsequent offense. The reincarceration rate was lower among Humaitá inmates, but “the validity of this finding is questionable due to extensive data loss.” Moreover, many relevant background factors, like age or criminal history, weren’t considered, perhaps because the data wasn’t available.

Finally, Humaitá differs from other Brazilian prisons (possibly including Bragança) in many ways unrelated to religion. The environment is more pleasant, prisoners and their families are treated better, there are more (non-religious) activities, and so on. Any improvements in recidivism could therefore have been caused not only by selection, but also by better secular prison conditions.

2. O’Connor et al.’s Theology Study

Thomas O’Connor and his coauthors compared recidivism between 54 inmates who participated in a master’s program in theology at Sing Sing prison and 402 non-participants. Completion of the ministry program was associated with a significantly lower risk of re-arrest in the first 28 months out of prison—only 9% of participants were re-arrested, compared to 37% of non-participants.

However, both self-selection and selection by program administrators taint these results. The students were selected by “a highly competitive application and reference process”; the program was open only to inmates with a college degree, who read and wrote well, and who had “references from chaplains and other inmates attesting to their religious commitment” and showed “a deep willingness to turn their lives around.” In fact, according to the president of the seminary that ran the theology program, the program had “built-in success” because they made sure to accept applicants “who want to learn who they are, what they value and what they believe in.”

3. Kerley et al.’s Religiosity Study

Kent Kerley and his coauthors examined the relationship between religiosity and negative prison behaviors at the Mississippi State Penitentiary in Parchman, Mississippi. First, they measured inmates’ religiosity using a survey. Most of these measures are irrelevant for our purposes because they don’t involve specific programming—for instance, inmates were asked whether they had experienced a conversion and whether they believed in God. But one of the measures was attendance at a one-day Prison Fellowship Ministries event called Operation Starting Line, “which included Christian musicians, comedians, professional athletes, and other speakers,” and which was held about six months before the survey.

Participation in Operation Starting Line predicted a significantly reduced rate of arguing with other inmates—52.5% of participants argued with other inmates once or more per month, as opposed to 60.0% of non-participants. But participants and non-participants didn’t differ statistically significantly in their likelihood of fighting once or more per month—18.9% for participants versus 19.3% for non-participants.

Inmates, of course, self-selected into the Starting Line events. In addition, the data was collected by a survey distributed to inmates, where both religiosity and negative behaviors were self-reported, where participation in the survey was voluntary, and where the response rate was 45%.

4. The Florida DOC’s Kairos Horizons Study

The Florida Department of Corrections, which ran a faith-based dorm, Kairos Horizons, at its Tomoka Correctional Institution, performed an unpublished study of the effectiveness of the program. To be eligible for the dorm, an inmate had to have had no disciplinary reports in the previous six months. The 59 inmates who spent the entire six-month program at the faith-based dorm were compared to 8 inmates who didn’t complete the six months, 741 inmates at Tomoka who didn’t participate at all, and 54,997 inmates at other Florida prisons. (The comparison groups were also limited to inmates without disciplinary reports in the previous six months.)

Inmates who completed the six-month program had lower rates of disciplinary reports than non-participants or inmates at other Florida prisons; about 5% of completers received disciplinary reports, compared to 37.5% of non-completers, 17% of non-participants, and 12% of inmates at other prisons. If—to see the effect of participation rather than the effect of program completion—we lump non-completers and completers together, the rate becomes about 9%, which isn’t significantly different from the rate among non-participants at Tomoka or at other prisons.

A similar faith-based program in England also reports greater disciplinary improvement among program participants.

5. Denny’s Kairos Horizon Study

Dan Denny analyzed in-prison misconduct and post-release recidivism rates for participants in a Kairos Horizon program at the Davis Correctional Facility, a private, medium-security prison in Oklahoma.

Denny examined three cohorts of participants, from “Year One” (2002), “Year Two” (2003), and “Year Three” (2004). The 36 Year One participants had 89% fewer misconduct reports after the program than before; the drop for the 51 Year Two participants was 80%; and the drop for the 51 Year Three participants was 84%. The average drop was 86%. Misconduct reports in the entire facility fell from 901 to 308 (a 66% drop) from Year One to Year Three, which is presumably comparable to the 80% before-to-after drop for the Year Two participants. It’s unclear from the paper how many inmates there were at the facility during this time, so it’s unclear whether the drop in misconduct among program participants is significantly different from the total decrease facility-wide.

When the paper was written, only seven participants had been released, the longest-released graduate had only been out for one year, and no graduate had been re-arrested. So the author couldn’t report “true recidivism rates” by Oklahoma standards, which require a three-year post-release history.

6. Education Studies

Some education studies also use this approach, neither addressing self-selection nor controlling for observable variables.

One example is Janet Beales and Maureen Wahl’s assessment of the Partners Advancing Values in Evaluation (PAVE) program in Milwaukee, a privately funded voucher system that functioned parallel to the publicly funded voucher system, the Milwaukee Parental Choice Program (MPCP). (A similar paper available online is here.) Beales and Wahl found that 63.2% of PAVE students scored above the 50th percentile in reading (60.4% in math), which was much higher than the corresponding percentages for MPCP students, Milwaukee public school low-income students, or all Milwaukee public school students. (These percentages were all between 16% and 35%.) PAVE students were similarly above the three comparison groups in reading and math test score medians and means.

However, the PAVE group differed from the other groups in various ways. Most obviously, the PAVE group, like the MPCP group, was self-selected, since one had to apply for a voucher; the public school students weren’t self-selected. But the PAVE group and the MPCP group weren’t comparable either: the PAVE scores were the test results of seventh-grade students, while the MPCP scores were test results from multiple grade levels, so the authors weren’t even comparing the same test. Finally, the authors couldn’t control for income, parental education, or other variables.

B. Studies with Some Controls

The studies in the previous section aren’t credible because participants in religious programs are just so different from non-participants. One possible fix would be to control for observable differences between participants and non-participants. This is what the studies reported in this section do: participants are matched with non-participants with observable characteristics that are as similar as possible.

But these studies are still vulnerable. An unobserved variable—motivation to change—affects both whether the inmate participates and whether he reoffends. Because motivation and success (avoiding re-arrest) are positively correlated, any effect we find is probably biased upward (ignoring any other sources of bias in one direction or another). A true zero effect may look like a positive effect because we’re measuring the effect of motivation.

In other words, if two prisoners are perfectly matched on the observables, but one of them chose to participate and the other didn’t, these two prisoners aren’t really well matched. Any study that finds better results among participants is thus still subject to self-selection bias.

1. La Vigne et al.’s Florida Study

Nancy La Vigne and her coauthors reported on six- and twelve-month recidivism rates of participants in two Florida “faith- and character-based institutions” (FCBI)—one male (Lawtey) and one female (Hillsborough). Participants were matched with a control group based on “sex, age, race, primary offense type, violent/non-violent offense, number of prior incarcerations, time incarcerated for current offense, time to expected release, and pre-study disciplinary report rate.”

At first, male FCBI participants had lower recidivism rates than their control group—none of the 189 male inmates from Lawtey were reincarcerated after six months, compared to four of the 189 male comparison inmates (2.1%). There was no significant difference for females, and twelve months out, there was no significant effect at all for either males or females. There was also no significant difference between average time to reincarceration for the faith-based inmates and the comparison inmates, for either males or females. The results here are thus extremely weak.

A later report by Diana Brazzell and Nancy La Vigne, using new data, continued to find “no statistically significant difference . . . in the proportion of FCBI and non-FCBI inmates returned to prison within 12, 18, 24, and 26 months of release,” for either males or females.

2. Rose’s Kainos Community Study

Gerry Rose evaluated the effect on reconviction of participation in the Kainos Community, a faith-based prison chiefly operating out of The Verne prison in England. The 84 participants were compared against a sample of 13,832 prisoners; the comparison sample was composed of all adult sentenced prisoners released from prisons in England and Wales in 1996 and 1997 who were British nationals, had served sentences of six months to 15 years, had been released from particular categories of prisons, and satisfied a few additional restrictions. In the Kainos sample, 22.6% of the participants were reconvicted within a year of release; among non-participants, the percentage was 25.9%. This difference wasn’t significant.

So far, this didn’t control for any variables. But Rose then went further, comparing the actual reconviction rates of Kainos participants with their own predicted reconviction rates. The predicted rates were based on a statistical model that controlled for observable factors such as their sex, offense category, age at first conviction, age at sentence, months spent in prison after sentence, and number of custodial sentences before age 21. Thus, rather than comparing participants and non-participants, he compared actual participants with hypothetical participants whose recidivism was predicted based on factors that didn’t include their participation in a faith-based program.

There, too, Rose found no significant effect: 25.0% of the Kainos sample was reconvicted, while the expected percentage would have been 26.0% or 24.2% (depending on which prediction model one used).

3. Young et al.’s Prison Ministry Study

Mark Young and his coauthors investigated “long-term recidivism among . . . federal inmates trained as volunteer prison ministers” as part of Prison Fellowship Ministries’ Washington D.C. Discipleship Seminars. Participants were sent to Washington for a two-week faith and leadership seminar, and their recidivism was compared to that of a control group. The control group was selected to match the experimental group with respect to race, gender, age at release, and the “salient factor score” (an estimate of a prisoner’s likelihood of recidivism).

Participants’ recidivism rate was 40%, while the control group’s recidivism rate was 51%. Participating women had a recidivism rate of 19%, compared to 47% for the control women, and participating men had a recidivism rate of 45%, compared to 52% for the control men. When the groups were further broken down by gender and race, participants had lower recidivism rates for all subgroups except black men.

As in the theology study above, these results are subject to both self-selection and selection by program administrators, in this case prison chaplains, who chose which inmates could participate.

4. O’Connor et al.’s Lieber Prison Study

Tom O’Connor and his coauthors reported on rates of in-prison infractions among participants in Prison Fellowship (PF) programming at Lieber Prison in South Carolina. Their data set of 1,597 included both participants and non-participants; 302 inmates attended at least one out of 47 Prison Fellowship meetings.

Participants had lower infraction rates than non-participants: “9.9% of PF inmates had an infraction since attending at least one PF program compared to the 23.2% of Non PF inmates who had an infraction.” The more an inmate participated in PF programs, the lower his chance of having an infraction.

Controlling for prior violent convictions, age, marriage status, and days spent in the prison, whether an inmate participated in PF programs strongly predicted lower infraction rates. “Non PF inmates were still 2.5 times more likely than PF inmates to have an infraction.”

The rate of participation in PF programs, controlling for the same variables, likewise strongly predicted lower infraction rates. But controlling for the rate of participation isn’t useful. Given a valid control group, the only valid comparison is between the control group and the entire treatment group. If we compare the control group to isolated, self-selected subsets of the treatment group, like those who participated the most in PF programs, we are merely reintroducing another layer of self-selection bias. Even if high participation reduces infraction rates (which is doubtful, given that the high participants may already be better people), the relevant question from a policy perspective, that is, from the perspective of someone wondering whether to introduce the program, is how well it works overall, including for those who choose not to participate much.

5. Wilson et al.’s COSA Study

Robin Wilson and coauthors examined the effect on recidivism of the Circles of Support and Accountability (COSA) program in south-central Ontario. Unlike the programs discussed so far, COSA isn’t an in-prison program; rather, it’s a support network, largely staffed by religious volunteers, to support the reintegration of released sex offenders into society. A group of 60 sex offenders assigned to COSA were compared against a group of non-participants who were similarly detained, had similar recidivism risk categories, were released around the same time, and had similar “prior involvement in sexual offender treatment programming.” The COSA group had significantly lower recidivism rates: the COSA group had a 5% rate of sexual recidivism and a 15% rate of violent recidivism, as compared to 17% and 35% among the comparison group.

Robin Wilson and coauthors found similar results in a follow-up study of COSA participants across Canada. There, too, the comparison group of 44 COSA participants from assorted Canadian cities was matched, according to similar control variables, to a group of sexual offenders who didn’t participate. The COSA group had lower rates of sexual recidivism (2.27%), violent recidivism (9.09%), and overall recidivism (11.36%) than the control group (13.67%, 34.09%, and 38.64%, respectively).

6. Self-Selection in Prisons and Schools

As I’ve pointed out above, self-selection also plagues studies of the effectiveness of private schools.

Early work by James Coleman and his coauthors estimated the effect of private schooling on sophomore scores, controlling for various background characteristics. Coleman et al. recognized that selection was a potentially serious problem, but noted that it was impossible to properly solve the problem “in the absence of random assignment to treatments, or something approximating it,” and that one had to proceed regardless.

Other studies found a weaker effect. Jay Noell, Doug Willms, Karl Alexander and Aaron Pallas, and William Morgan analyzed the same data with different specifications and different control variables and found a much weaker effect of private schools. John F. Witte and his coauthors found that students in the Milwaukee voucher program didn’t “differ in any predictable way on achievement tests” from Milwaukee public-school students over the first four years of the program. And, in a recent study, Harold Wenglinsky similarly controlled for various observable variables and followed students over time, and found no positive effect for private schools.

Various studies found effects that differed according to the precise outcome variable or the precise population being studied. Cecilia Rouse, comparing Milwaukee voucher students with Milwaukee public school students, found a substantial effect on math scores, but no effect on reading scores, of being selected to attend a voucher school in Milwaukee. Jeffrey Grogger and Derek Neal found significant effects on high-school graduation rates, college attendance rates, and math test scores. Gains for urban minorities were especially large, but there was “little evidence of math-achievement gains for suburban minorities in Catholic schools.”

Private-school researchers have also investigated whether the public versus private choice affects the growth of test scores from the sophomore to the senior year. Coleman and his coauthors did this by comparing two different cohorts—a sophomore class and a senior class in the same year. Later, John Chubb and Terry Moe, as well as Douglas Willms and Karl Alexander and Aaron Pallas, who had the benefit of follow-up data, compared the sophomore and senior scores of the same students. But these methods also don’t control for selection bias if one believes (as is plausible, and as Coleman et al. agree) that selectivity affects growth rates in addition to levels.

It should be clear that prison and education studies share common methodological problems. We can discount any positive results of these studies as being potentially artifacts of self-selection. But what about the studies that found no effect—for instance, in the faith-based prison case, the La Vigne studies, and the Rose study? Surely, if positive results are overstated by some unknown amount, zero results must prove that faith-based prisons don’t work at all, and that the true effect is, if anything, negative?

This is tempting, but we should resist this conclusion for the following reasons:

  • The self-selection bias overstates results, but there may be other empirical problems that tend to understate results. For instance, there may be other unobserved variables that are negatively correlated with success. (Perhaps people also tend to participate in programs if they feel they need it more? Perhaps programs that provide additional resources to inmates and that are selective also attract inmates who are good at lying to the program administrators about their suitability for the program? Perhaps, if participation in a program contributes to parole decisions, the program attracts problem inmates who are more likely to need the good points on their record? Generally, there is always a problem with insincere inmates who take advantage of religious programs to “gain protection,” “meet other inmates,” “interact with volunteers,” and “gain access to prison resources,” quite apart from any desire to reform.) Or there may be measurement error in the dependent variable (i.e., some of the inmates who are re-arrested are wrongly coded as not having been re-arrested and vice versa), which tends to reduce the measured effect. So just as a positive measured effect could hide a true zero effect, a zero measured effect could hide a true positive effect.
  • Every program is different, and some programs may only have a zero measured effect because they were badly designed or badly run. Their failure needn’t reflect badly on other programs that are done well—in fact, even if only a handful of programs “work,” but if those programs, once they have been shown to work, can be replicated, the whole process of experimentation can be thought to have been a success.
  • Alexander and Pallas noted that the effect of private schools appeared much smaller when the follow-up data was analyzed and students’ previous test scores were used as controls for their current performance. This dramatic change from a background-controls-only specification to a background-controls-and-test-scores specification, they argued, showed “that background proxies are simply inadequate when attempting to assess the impact of school organization on cognitive outcomes.” This is a modest moral of the “background proxy” studies: when one’s empirical method is subject to an important source of bias, the precise specification can have a large effect on the results.

Perhaps most importantly, we now have other studies that are methodologically more valid. We thus don’t need to spend too much time interpreting the results of the less valid studies. Tomorrow’s post will discuss some of these potentially more valid studies.