This is the third post in a series on the effectiveness of faith-based prison programs, based on my recent Alabama Law Review article, Do Faith-Based Prisons Work? (Short answer: no.) Monday’s post introduced the issue, and Tuesday’s post surveyed some of the least valid studies.

Why go through this? Because it nicely illustrates the basic problem with this sort of evaluation: selection effects, in particular self-selection effects. Because faith-based prison programs are voluntary (and pretty much have to be, given Establishment Clause doctrine), the people who participate may be fundamentally different from the people who don’t. For instance, they may have greater motivation to change, which might already make them a better population. So — unless you can find some way to compare identically motivated people — comparing participants with non-participants will generally give you positive results that are completely spurious. And this is true even if you control for a bunch of variables, as long as there remains a component of motivation that’s unobservable.

Today, I discuss another way of evaluating faith-based prisons: by matching on propensity scores. Not everyone has heard of propensity scores, so I’ll try to do my best to explain them in terms a layman might be able to understand. No guarantees, though: empirics are hard! My conclusion is that this method also isn’t sufficient to solve the self-selection problem.

As before, I also illustrate using the private (or Catholic) school evaluation literature, which also has similar methodological problems: since attendance at private schools is voluntary, there might be something unobservable about participating families that already makes them more likely to succeed academically. So naive comparisons won’t cut it — and, I argue, neither does propensity score matching.

In the second part of the post, I talk about various methods that are more valid, but that haven’t been attempted (so far, as far as I know) in the faith-based prison evaluation literature. One method is the instrumental variables method, and another method is the exogenous policy shocks method. Because these methods haven’t been used for prisons, I illustrate them with the education literature.

*     *     *

Matching on the Propensity Score

In this Part, I discuss a technically more sophisticated way of dealing with selection problems: propensity score matching.

In propensity score matching, the researchers first identify the observable variables that best predict whether someone will participate in the program. This first-stage estimation generates a “propensity score” for each inmate; this is essentially an estimated probability of participating in the program. One inmate may have participated and the other may have not, but they may both have propensity scores of, say, 70%, so that they are estimated to be equally likely, ex ante, to have chosen to participate.

The matching process then matches each participant to another participant with a similar propensity score; a 70% propensity participating inmate is matched with a 70% propensity non-participating inmate, even if these inmates may differ on various individual characteristics.

Practitioners of propensity score matching point to certain advantages of the method over trying to match on observable variables directly. Given a participant with particular observable characteristics, it is often hard or impossible to find a non-participant with identical, or nearly identical, values of those same variables; by contrast, it is easier to match according to a single number.

But propensity score matching can’t overcome the problems of selection bias in the case of faith-based prisons. To see this, suppose that there were so many non-participating prisoners that exact matching on observables was always possible; every participating inmate would be matched with a non-participant who looked exactly identical. Because these two inmates would have identical observable characteristics, they would also have identical propensity scores. Matching on propensity scores would then produce exactly the same control group as the previous set of studies, which matched on observables directly.

Thus, if the direct matching studies weren’t credible, the propensity score matching studies aren’t credible either. Using the propensity score may improve the efficacy of matching, but it doesn’t alleviate the self-selection problem.

More technically, the problem is that propensity score methods give the correct result if nonobservables play no role in the selection mechanism, or more precisely, if the unobserved determinants of participation play no role in ultimate success (that is, low recidivism). This assumption is quite false in the case of faith-based prison programs, where motivation to change, and possibly religiosity itself, both determine participation in the program and play a large role in whether an inmate reoffends. James Heckman and Richard Robb argue that “[t]he propensity score methodology solves a very special problem . . . that is of limited interest to social science data analysts.” Whether Heckman and Robb are right about the interest of propensity score studies in general, faith-based prison evaluation certainly seems like one area where the method doesn’t seem credible.

1. O’Connor et al.’s New York Study

Tom O’Connor and his coauthors analyzed the effect on prison infractions and recidivism of participation in Prison Fellowship programs in New York prisons. The participating group of 225 inmates was matched with a control group based on race and a propensity score calculated using six variables—“age, religion, county of residence, military discharge, minimum sentence and initial security classification.”

The study found no significant difference between participants and the control group in prison infractions, number of re-arrests, or time to re-arrest. Among participants, 37% had infractions: 28% had security infractions, 16% had nonviolent infractions, and 15% had violent infractions. In the control group, the percentages were 32%, 23%, 18%, and 11%, respectively. None of these differences were significant. Nor was there any significant difference in the frequency of re-arrest (36% for participants versus 34% for non-participants), though a difference emerged when arrests were broken down by type of charge. Participants were “more likely to be re-arrested for a violent offense” (28% versus 16%), but “less likely to be re-arrested for a drug offense” (21% versus 44%). There were also significant differences when re-arrests were broken down by region—for whatever reason, a re-arrest of a participant was more likely to occur in upstate New York (and less likely to occur in New York City or suburban New York) than the re-arrest of a non-participant.

The authors then divided the group into high-participating and low-participating groups. There was still no significant difference between high and low participants in infraction or re-arrest rates. The authors then computed a score from 0 to 3 for each inmate, based on the “Level of Supervision Inventory” that measured their estimated risk of being re-arrested, and then classified inmates by PF participation level (none, low, or high) and risk score (0, 1, 2, or 3). When they did this, they found that among high-risk PF inmates—that is, inmates with a risk level of 3—high-participating inmates were significantly less likely to be re-arrested than low-participating inmates.

However, as I have explained above, we shouldn’t read anything into this last set of results. Any analysis that divides inmates by levels of participation merely reintroduces self-selection bias. One can’t compare the control group against a self-selected sample of the treatment group, nor can one compare self-selected parts of the treatment group (high Bible study participants) against other self-selected parts (low Bible study participants). Even if this told us the effect of high participation (which it probably doesn’t), the proper question for a policymaker deciding whether to introduce such a program is how well it works for everyone, including those who choose not to participate much.

2. Johnson et al.’s New York Study

Byron Johnson and his coauthors reanalyzed this data, using only 201 inmates instead of the original 225. They found substantially the same results. There was no significant difference between participating and non-participating inmates in rates of infractions (36% versus 31%), serious infractions (8% versus 9%), or re-arrest (37% versus 36%).

When inmates were broken down by level of participation (low, medium, or high), there continued to be no significant difference between Prison Fellowship (PF) and non-PF inmates, except that high-participating PF inmates were re-arrested at lower rates than their non-PF counterparts (14% versus 41%). High-participating PF inmates were also significantly less likely to be arrested than low- or medium-participating PF inmates. The authors also further broke down inmates by risk level and found that high participation continued to be associated with a lower re-arrest rate.

But, as discussed above, we shouldn’t divide the sample based on participation level, since this introduces a new source of self-selection bias.

When Johnson did a follow-up evaluation on these same inmates seven years later, he again found no significant difference in median time to re-arrest or in reincarceration rates between participating and non-participating inmates. When the sample was divided into high- and low-participating groups, high-participating inmates had a lower two-year probability of re-arrest than low-participating ones, but this effect disappeared after three years.

3. Camp et al.’s Life Connections Program Study

Scott Camp and his coauthors analyzed the effect on prison misconduct of participation in the Life Connections Program. They estimated the probability of participation (i.e., propensity score) using a number of models; the fit of these models was reasonably good. Variables used included a “scale of motivation for change,” frequency of spiritual experiences and religious observance, religious affiliation, “feelings of self-worth,” custody risk, previous incarceration, age, ethnicity, “race, sex, education, marital status, and months of current incarceration” so far.

There was generally no significant association between participation and misconduct in general, and no association between participation and less serious misconduct. However, there was a significant association between participation and serious misconduct: in some of the models, “slightly over 5 percent of the inmates in the LCP had an instance of serious misconduct, where for the comparison group, the number was closer to 11 percent.” Other models on serious misconduct produced differences that were smaller, but still significant.

This article has a significant advantage that the others in this Subpart don’t have. I’ve argued that the problem with comparative studies, even ones based on propensity scores, is that they don’t get at the unobserved motivation to change. As I’ve noted above, though, Camp et al. explicitly include “a scale of motivation for change” in their first-stage propensity model. If this scale accurately measures motivation for change, then it can potentially solve the selection problem. Unfortunately, this scale, developed by Prochaska and DiClemente, is derived from inmates’ own self-reported views, so it should be taken with a grain of salt.

4. Education Studies

As with the previous set of studies, this ground has already been trodden by education researchers, with similar methodological vulnerabilities. Unobserved motivation is as problematic with private or Catholic schools as with faith-based prisons—a student’s (or his parents’) motivation is correlated both with a decision to choose a different school and with success on outcomes like test scores.

Thomas Hoffer and his coauthors (including James Coleman) predicted the probability that a student would choose a Catholic school using the background measures used in his base-year analysis and a measure of sophomore achievement. Then they “stratified the sample into quintiles of the propensity score[s] and estimated Catholic-school effects within each of these homogeneous groups.” They found that controlling for selection using this method didn’t change the results much relative to the results earlier in their paper, which they had estimated without propensity scores.

Stephen L. Morgan similarly estimated propensity scores and stratified the sample into quintiles. He found that “there is considerable variation in estimates of the average causal effect for Catholic school students with different propensities for attending Catholic schools”; “the Catholic students who are least likely to be enrolled in Catholic schools . . . are the most likely to benefit from having attended a Catholic school.” Overall, he found that students in Catholic schools benefited from attending those schools, and—unlike Hoffer et al.—the effect he estimated was larger than the standard regressions that didn’t control for selection into Catholic schools.

In any event, because these studies don’t account for selection on unobservables, it isn’t worth dwelling on them at length. Since there are more valid studies that are able to control for selection on unobservables, let’s move on to those.

*     *     *

Potentially Valid Studies

The only credible studies of faith-based prisons done so far have been those where the comparison group of inmates was made up of those who volunteered for the faith-based program but were rejected. However, before describing those studies, I discuss a few empirical strategies that have been used for private schools but, for whatever reason, haven’t been attempted for faith-based prisons: the instrumental variables method and identification by exogenous policy shocks.

The Roads Not Taken

The empirical literature on education is extremely large, and there has been a lot of debate on appropriate empirical methods. Here, I focus on two widely used approaches that can deal with selection: the instrumental variables approach and the exogenous policy shock approach.

1. Instrumental Variables

Standard regression models (the “ordinary least squares” method) take as given that we won’t be able to explain all of the variable of interest, whether that variable is ex-prisoners’ recidivism or students’ test scores. There will always be some error, as is recognized by the ε term in the standard notation, y = Xβ + ε. The models do, however, demand that the average value of the error term, ε, not depend on the explanatory variables in X. It’s useful to think of the error as embodying not just whatever inherent randomness may exist in the world, but also every omitted variable. The requirement that ε, on average, not depend on X, can thus be interpreted as a rule that one can harmlessly omit variables (either by choice or because they’re unobservable) as long as the omitted variables are uncorrelated with the included ones.

This is precisely the problem with selection bias: the inmates’ or the student’s parents’ motivation (which is an omitted variable and therefore part of the error term) is correlated with the main explanatory variable—whether or not the inmate signs up for a faith-based prison program or the student attends a private or Catholic school.

If we ran an ordinary least squares regression on the equation above, we would get biased estimates of β. But there are ways around this. Suppose we could find some other variable, Z, that predicted X but was uncorrelated with unobservable motivation. For instance, suppose Catholic religion (Z) predicted whether someone attended Catholic school (X) (this seems true, since Catholics are more likely to attend Catholic school) but was uncorrelated with the unobservable determinants of scholastic success (this seems possible, since why would Catholics do better in school?). We would call Z an instrument for X.

We would then use a two-stage process, called the instrumental variables (or IV) method. Initially, we would use Z to obtain a predicted value of X—call it X’. Instead of having a 0 or 1 value of whether someone attended a faith-based prison program or Catholic school, we’d have their predicted value based on Z; this would typically be a number between 0 and 1, and we could think of it as their probability of attending the program.

Once this first stage was done and we had our predicted X’, we would replace X with X’ in the regression, and estimate the regression y = X’β + ε. We would then use the resulting estimate of β. (This method thus has the flavor of matching based on propensity scores, as discussed above, but it has the advantage of being able to handle selection on unobservables.)

Mathematically, it turns out that, unlike the naïve estimation, this two-stage IV process gives us an unbiased estimate of β. The advantage of using X’ instead of X is that, because X’ is just predicted off of Z (which is uncorrelated with the error term), it isn’t “contaminated” by whatever is in the error term, like unobservable motivation. In essence, using the two-stage process has “purged” X of the pernicious effects of unobserved motivation.

Of course, whether the IV method works depends on whether we can find a true instrument—something that really predicts X and is really uncorrelated with ε. Good instruments are hard to find. We can test whether Z predicts X—just try doing it and see how well it works—but we can’t directly test whether Z is correlated with ε, since the true error term is unknown; this is unfortunate, since even moderate correlations can introduce substantial bias into the IV estimates.

These potential problems haven’t stopped education researchers from using IV methods.

James Coleman and his coauthors used two strategies. First, they used religion together with region (Northeast or other) as instruments for Catholic school attendance; then they used religion together with income and educational expectations in the eighth grade. They rejected both of these models because the resulting Catholic-school effect was implausibly large. But note that even if religion is a valid instrument, it seems that income and previous educational expectations should be correlated with the unobservable determinants of scholastic success, which makes them invalid instruments.

Other authors, using different specifications, have found conflicting results. Jay Noell, in the reanalysis of Coleman’s work discussed above, also used Catholic religion as an instrument for Catholic school attendance; this made the Catholic-school effect insignificant. Richard Murnane and his coauthors, on the other hand, used Catholic religion as an instrument and determined that Catholic school attendance had a significant effect on Hispanic students’, and possibly also on black students’, achievement.

Using Catholic religion as an instrument seems to have fallen out of fashion, after various researchers suggested that being Catholic is unfortunately correlated with the unobserved determinants of scholastic success. The same goes for a related variable, frequency of church attendance.

A better instrument might be a variable unrelated to one’s own characteristics—perhaps the Catholic share of the population of one’s county, which could affect Catholic school attendance just because Catholic-heavy counties have more Catholic schools and possibly lower tuitions because they’re more heavily subsidized by their local congregations.

Thus, William Evans and Robert Schwab used, among other variables, Catholic county population as an instrument. This strategy didn’t change the high-school graduation results much compared to a naïve specification without instruments, though the college entrance results were more sensitive to the choice of specification.

Jeffrey Grogger and Derek Neal used the county’s Catholic school density and the county’s percentage of Catholic population. They found Catholic-school effects on high-school graduation for urban minorities that were even larger than in the models without selection. They also found significant effects for urban whites, though no effects for suburban students (whether white or minority). There were no significant effects of Catholic school on college entrance.

Derek Neal used these same variables—county Catholic school density and county Catholic concentration—but not at the same time. He estimated two different models since the validity of the instruments seemed to differ as between urban minorities and urban whites. The analysis of minorities used only Catholic school density as an instrument, while the analysis of whites used only local Catholic population density. A positive effect of Catholic school attendance on high-school graduation rates remained after this correction for selection bias and, in fact, even increased.

Other studies use instruments unrelated to Catholicism. William Sander and Anthony Krautmann used, among other variables, “urban” interacted with region and concluded that Catholic schooling has a highly significant negative effect on the probability that a sophomore drops out before his senior year, but no effect on educational attainment beyond high school.

Dan Goldhaber also used a number of variables, including controls for the cost and availability of private schools, dummy variables for region and urbanicity, and percent of white students at the students’ school. He found no positive sectoral effect favoring private schools.

David Figlio and Joe Stone predicted sector choice using, among other factors, whether the state had “duty to bargain” or “right-to-work” laws, as well as median county income. They found that private schools, whether religious or nonreligious, had no relation to math test scores, but were significantly related to two years of college enrollment, as well as enrollment in a selective college.

All these models use different specifications, have different choices of instruments, and yield different results. Some find an effect of private or Catholic schools; some don’t. The moral, though, is that finding a good instrument is hard. Many instrumental-variables studies have been sloppy about why the instrument Z is correlated with X and why it’s uncorrelated with e. Pretty much any individual attribute, whether Catholic religion, or income, or race, probably has some correlation with the unobserved determinants of success. Aggregate variables, like perhaps the Catholic population density in the child’s county, may work better, but of course aggregate variables may also affect achievement. Moreover, the aggregate approach only works as an estimation strategy if we observe children from a large number of different aggregates: If all the children in the study come from the same county, we won’t be able to use the local Catholic population density as an instrument since it will be the same for each child.

This is a problem for faith-based prison studies as well. So far, almost all faith-based studies have analyzed the results of a single faith-based program at one prison. Only a small number deal with more than one prison. Perhaps an IV approach wouldn’t have been very useful in most of these cases, but it would be worth exploring the IV method when there is a data set with inmates from several prisons.

2. Exogenous Policy Shocks

Other studies have identified the effect of educational policies using exogenous shocks. Some of these are natural shocks; some are policy shocks when a policy is first introduced; some are policy shocks when an already existing policy is applied in a particular context for a random reason.

Here are some examples, unrelated to the public versus private school debate:

  • Caroline Hoxby identified the effect of class size on student achievement using two strategies. First, she used natural randomness in the population, which makes certain classes larger or smaller from year to year. Second, she used “the fact that class size jumps abruptly when a class has to be added to or subtracted from a grade because enrollment has triggered a maximum or minimum class size rule.” (This is the “regression discontinuity” approach.) Both strategies showed little or no effect of class size on achievement.
  • Joshua Angrist and Victor Lavy used a similar regression discontinuity approach to study the effect of class size on achievement in Israel. Unlike Hoxby, they found a negative effect of class size on achievement. (Of course, in all studies of this type, we want to guard against parents’ ability to game the system by choosing schools with enrollments just above the cutoff, which would bias the results.)
  • Martin West and Paul Peterson examined the effect on a Florida public school of receiving an F grade on the state’s A+ Accountability Plan. Students at schools that received an F twice would get a voucher for private school; F schools were also assigned a team to write an intervention plan for the school. In 2002, Florida changed its evaluation system so that most schools received a different grade than the previous year. To isolate the effect of an F—separate from the effect of being subject to a voucher threat—the authors focused only on the 24 schools that hadn’t previously gotten an F, that wouldn’t have gotten an F under the old system, but that did get an F under the new system. They compared these schools to all D schools whose scores were close to those of the 24 F schools. They found that getting an F had a significant positive effect on student achievement. The same was true for D schools, as compared to C schools.

There are many more examples. Exogenous policy shocks are another way of dealing with self-selection: If we compare an entire prison before the introduction of a faith-based program with the same prison after the introduction of the program, we don’t have to deal with self-selection issues as long as people don’t choose which prison they go to, and as long as the assignment mechanism didn’t change once the program was introduced.

Or we could compare a prison with a faith-based program to a prison without one, though one would want to be sure that the two prisons are really comparable. Again, the comparison would have to be between entire prisons since limiting the set at one prison to participants would introduce self-selection issues.

Or one could merge the two approaches and observe how the difference between two prisons changed when a faith-based program was introduced at one of them. This would essentially be a differences-in-differences approach.

So the exogenous policy shock approach seems promising for faith-based prisons. This is another area where prison researchers could learn from education researchers.

In tomorrow’s post, I’ll discuss a valid method that has been used in faith-based prison research: the rejected volunteers method.