School reformers this year had something of a banner year, moving ahead with key initiatives such as using standardized test scores to evaluate teachers, expanding charter schools and establishing voucher programs that permitted the use of public funds to be used to pay religious school tuition. But is any of this grounded in research? Here’s a look at the year in ed research from Matthew Di Carlo, senior fellow at the non-profit Albert Shanker Institute, located in Washington, D.C. This post originally appeared on the institute’s blog.

By Matthew Di Carlo

This year was a busy one for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.

As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*

Some lag time is inevitable, not only because good research takes time, but also because there’s a degree to which you have to try things before you can see how they work. Nevertheless, what we don’t know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind” feeling. Moreover, as is often the case, the only unsupportable position is certainty.

In the area of merit pay, there was one large-scale program evaluation released in 2012: Mathematica’s final report on the four-year evaluation of the Chicago Teacher Advancement Program (TAP). TAP is a multifaceted program that provides bonuses, career ladders, training, and other interventions for teachers in order to improve their performance and retention. The findings were somewhat mixed. Consistent with prior research (see here and here), the Mathematica team did not find any discernible effects on testing outcomes over this (relatively short) period, but there did appear to be some impact on school-level teacher retention.**

However, most of the action in 2012 was a bunch of papers that might be interpreted as attempts to understand and/or address teacher incentive programs’ failure to produce results in the U.S.

One example, which received a fair amount of attention, was this evaluation, by Roland Fryer, Steven Levitt, and colleagues. In this program, teachers were paid a bonus at the beginning of the year, with some forced to return a portion of it based on their students’ progress on tests. This kind of incentive, called “loss aversion,” had a large impact among teachers in the treatment group. Although this finding is genuinely interesting from a research perspective, and definitely merits further attention, its policy implications are still less than apparent. To their credit, merit pay proponents seemed to recognize this.

A second working paper assessed incentive strength in a group-based program. Based on the idea that rewarding larger groups of teachers decreases the incentive for each individual teacher (essentially, a free-rider problem), the analysis found that impacts were slightly larger (in math, reading and social studies, but not science) for teachers who were “responsible” for larger groups of students. This, the authors speculate, may be one reason why schoolwide bonus programs (e.g., New York’s) haven’t produced results.

Finally, a conference paper using data from a schoolwide bonus program in North Carolina found that schools that just missed the cut in one year tended to exhibit large gains the following year, relative to schools that came in just above the threshold. The researchers hypothesize that this suggests teachers and administrators may respond to incentives when they receive a clear signal that rewards are possible, but that a “period of learning” may be required before these programs exhibit impacts.

On the whole, these studies (also see this paper) would seem to suggest, unsurprisingly, that teachers and administrators do respond to incentives, but not necessarily those embedded in the “traditional” models (e.g., individual bonuses for scores at the end of the year). However, most of the new merit pay policies rely heavily on the “traditional” conceptualization (from which supporters inexplicably seem to be expecting short-term testing gains), and, again, the actual policy applications of these recent findings, if any, remain unclear.

Thus, predictably, merit pay ends the year in roughly the same situation as it started. Proponents contend that the primary purpose of alternative compensation systems is less to compel effort than to attract “better candidates” to the profession, and keep them around. From this perspective, it is unlikely that we will see much in the way of strong evidence – for or against –for quite some time, and short-term testing gains may not be the most appropriate outcome by which to assess these policies (see this simulation from this year, as well as our discussion of it). In other words, merit pay remains, to no small extent, a leap of faith.

Moving on to the charter school area, 2012 was another year of extremely rapid growth in this sector. It may also represent a turning point in the direction of research on these schools.***

In contrast to previous years, very few of the analyses released employed the typical charter versus district “horse race” approach. The only notable exceptions were two CREDO reports. Their analysis of New Jersey’s small charter school sector found that the charters included had a significant positive impact vis-à-vis comparable regular public schools, though it appeared largely confined to a small group of Newark schools. Similarly, the CREDO team’s evaluation of Indiana charters also found statistically significant, though rather modest positive effects statewide, mostly concentrated in Indianapolis.

Although such “horse race” studies were scarce this year, a Mathematica report addressed the methodological question of whether results from experimental evaluations of charter school impacts “match up” with those from non-experimental treatments. This is important because most charter research relies on non-experimental methods (since experiments generally require lotteries). This report (following others) put forth the encouraging result that the experimental and non-experimental estimates were not particularly different, at least not enough to substantially alter the conclusions. We discussed this paper here.

There was also some progress in building the growing and arguably most important body of evidence – analyses that attempt to move beyond the “charter versus district” debate, and begin to identify the actual differences between more and less successful schools, of whatever type.

One contentious variation on this question is whether charter schools “cream” higher-performing students, and/or “push out” lower-performing students, in order to boost their results. Yet another Mathematica supplement to their 2010 report examining around 20 KIPP middle schools was released, addressing criticisms that KIPP admits students with comparatively high achievement levels, and that the students who leave are lower-performing than those who stay. This report found little evidence to support either claim (also take a look at our post on attrition and charters).

A related analysis, this one presented in a conference paper (opens in Word), found that low-performing students in a large anonymous district did not exit charters at a discernibly higher rate than their counterparts in regular public schools. On the flip side of the entry/exit equation, this working paper found that students who won charter school lotteries (but had not yet attended the charter) saw immediate “benefits” in the form of reduced truancy rates, an interesting demonstration of the importance of student motivation.

A couple of papers also looked at more concrete policies employed by charters. Most notably, a joint report from Mathematica and the Center for Reinventing Public Education focused on practices among charter management organizations (CMOs). A previous report by the same team, released in 2010, found that the charter schools run by these CMOs were, on the whole, comparable in terms of test-based performance to their regular public school counterparts, even though the sample consisted of more established organizations (which one might expect to do well).

Among the many findings presented in this useful follow-up were that the higher-performing CMOs included in the analysis were more likely to provide teacher coaching and performance pay, and that they offered, on average, more instructional time (also take a look at this initial set of findings from CREDO about the performance trajectory of new charter schools).

So, 2012 saw no major “bombshells” in the charter school literature – and that may be a good thing, since focus may be shifting to the important, albeit unsexy task of drilling down into mechanisms underlying the overall results. In a time of unprecedented charter proliferation, explaining the consistently inconsistent performance of these schools is critical, not only for guiding the authorization of new charters, but also, more importantly, for improving all schools, regardless of their governance structures.

In the third and final area of market-based reform – the use of value-added and other growth model estimates in teacher evaluations – 2012 might be remembered as the year in which a second batch of teacher-level value-added scores were published in a major newspaper. These consisted of the “teacher data reports” from New York City. As in Los Angeles in 2010, the publication provoked opposition among value-added supporters and opponents alike. We discussed and analyzed the data in this post (also see here).

But, in terms of actual original research, we’ll begin with the paper that received more attention than most any other in recent years: The analysis of the long-term impacts of teachers, by economists Raj Chetty, John Friedman, and Jonah Rockoff (the paper was actually released very late in 2011). The enormous reaction to this working paper focused mostly on the finding that increases in estimated teacher effectiveness are associated with very small improvements in a wide variety of future student outcomes, including earnings, college attendance, and teenage pregnancy.

It is fair to say that these findings, in addition to being genuinely interesting and important from a research perspective, support the long-standing contention that value-added estimates do transmit some meaningful signal about teacher performance, and might play a role in teacher evaluations (though not necessarily the role that they’re being called upon to play).

Another part of the paper, which got comparatively little attention, was arguably just as significant from a policy perspective – the results addressing the question of whether the non-random assignment of students to classrooms biases value-added estimates. That is, whether some teachers are assigned students based on characteristics that are associated with testing performance, but are not captured by the models (see Jesse Rothstein’s highly influential articles on this  – here and here).

Chetty, Friedman, and Rockoff devise a clever test for this bias, and find that the problem does not appear to be critical (also check out this earlier response to Rothstein). In addition, they provide a very easy way for states to test their own estimates for this bias, using data that are widely available. It is unclear whether any states have chosen to do so, however. You can read our discussion of the Chetty et al. paper here.

The issue of non-random assignment of students to classrooms, and its potential influence on value-added scores, was also the focus of this 2012 CALDER paper, which questioned the validity of the “Rothstein test,” and found that it might identify non-random sorting even when none exists (also see this conference paper, which concluded that principals do assign students in non-random ways, and that the extent of sorting varies within and between schools). On the whole, it is likely that non-random sorting does bias teacher value-added scores, but the magnitude of this bias – and how it compares with that of alternative performance measures – remains a somewhat open question.

Other analyses in the value-added area also continued to move toward the kind of concrete, policy-relevant research that might have guided the design of new evaluation systems. For instance, one big issue facing states is the choice of a model. Although the public discourse tends to portray value-added models as a kind of monolith, there are actually a bunch of different specifications, many of which are not actually value-added models per se (value-added models, which themselves come in different forms, are generally considered to be a specific type of growth model; other types, such as student growth percentile models, are also being used by states and districts). Thus, analyses of how results differ between models are quite important.

working paper from the Center for Education Data and Research (CEDR) compared the results of different models using North Carolina data, and found relatively high correlations between most of the models tested, including the more common types being used in actual evaluations systems. There was, however, a much lower correlation between these and more complex models (in particular, those employing fixed effects), and, in all cases, differences in the compositions of classrooms influenced the results of these comparisons (also see this similar analysis from this year).

In other “nuts and bolts” papers, Mathematica researchers laid down some statistical techniques for handling the important issue of co-teaching, while this NBER working paper presented a practical method for handling test measurement error among students who take tests in three consecutive grades. On an even more basic level, a team from the University of Wisconsin simulated the potentially serious problems that might arise from simple “clerical errors” in the datasets used to calculate value-added.

It’s difficult to assess the degree to which states and districts are addressing or considering issues such as error and model choice, but, in at least some cases, it’s not clear that they’re getting much attention at all.

Another area that is under-researched (and related to the non-random assignment issue discussed above) is value-added among high school teachers. One potentially important NBER working paper released this year found that high school teachers’ estimates may have serious problems, particularly those stemming from tracking. The author, C. Kirabo Jackson, offers two possible interpretations: either high school teachers are not as influential (in terms of test-based impacts) as elementary school teachers; or value-added is a poor tool for measuring the effectiveness of high school teachers. (Also see this extremely interesting 2012 working paper, by the same author, comparing teachers’ impact on cognitive and non-cognitive outcomes.)

Finally, a number of new analyses tackled the important, much-discussed issue of the stability of value-added estimates. First, using a dataset spanning a full ten years, a CALDER working paper found considerable volatility, but also some persistence, in teachers’ value-added scores, even over that very long time period. A second analysis (opens in Word) concluded that the precision and year-to-year stability of teachers’ value-added scores varied considerably by the types of the students they had in their classes.

Third, this CEDR paper looked at stability, not across years, but rather between subjects. This is actually a somewhat under-researched topic, despite its obvious implications for using these estimates in accountability systems. In short, the authors found that the between-subject correlations are similar to those between years – modest.

Overall, then, there was a great deal of strong research on value-added and other growth models this year, much of which can inform (or, at least, could have informed) the design of teacher evaluations. In the end, though, the real test will be whether the new systems improve teacher and student outcomes.

There is one more 2012 publication worth mentioning in this area, which consists of a series of five “background papers” on value-added, published by the Carnegie Knowledge Network. Each deals with a different aspect of these estimates, and all are written by prominent researchers, who present the state of the research in a manner that is quite accessible to non-technical readers. They are an excellent resource.*****


So, 2012 was a year in which the research on charter schools, merit pay, and value-added continued to provide policy-relevant findings, even if, in some cases, the decisions this evidence might have guided had already been made. The next few years will be critical, as researchers monitor and evaluate the sweeping policy changes that have taken place, particularly new evaluation systems and the financial incentives that accompany them. It remains to be seen whether states and districts will be willing or able to adjust course accordingly.


* Needless to say, these three areas are not the only types of policies that might fall under a “market-based” umbrella (they are, however, arguably the “core” components, at least in terms of how often they’re discussed by advocates and proposed by policy makers). In addition, the papers discussed here do not represent a comprehensive list of all the research in these areas during 2012. It is a selection of high-quality, mostly quantitative analyses, all of which were actually released during this year (i.e., this review doesn’t include papers released in prior years, and published in 2012). This means that many of the papers above have not yet been subject to peer review, and should be interpreted with that in mind.

** There was also some initial evidence on the impacts of Denver’s ProComp program, which, like TAP, provides different types of incentives and opportunities for teachers. A non-experimental evaluation from the Center for Education Data and Research (CEDR) found some tentative indication that the program may improve test-based outcomes, though it was not possible to rule out the possibility that this was due to other factors (also see our post, by researcher Ellie Fulbeck, who took a look at ProComp’s effects on retention).

*** Given the volume of studies, this review does not include other types of school choice policies, such as vouchers. Those interested might check out this Brookings analysis of New York City’s voucher program’s effect on graduation outcomes; as well as the summary of final reports on Milwaukee’s school choice program.

**** The scarce evidence, thus far, suggests that the handful of charter models that get fairly consistent positive results are those utilizing a somewhat “blunt force” approach – more money, more time, more staff, and more rigid disciplinary policies.

***** Though not discussed in this review, it’s worth noting that there was a great deal of 2012 research about expanding the policy applications of value-added models into additional areas. Most notably, there were several analyses of principals’ impact on testing outcomes (see here and here), as well as a few important papers (herehere, and here, for example) about the potential for using these methods to estimate the effectiveness of teacher preparation programs.

The views expressed in this post do not necessarily reflect the views of the Albert Shanker Institute, its officers, board members, or any related entity or organization.