A valid way to use ‘value added’ in teacher evaluation

Whether and how value-added models for teacher evaluation — which use student standardized test scores to assess teachers — only becomes more controversial in education as time goes on. Here is a new approach on the subject from Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, “Value-Added Measures in Education,” provides an accessible review of the technical and practical issues surrounding these models. This appeared on the blog of the non-profit Albert Shanker Institute. 

By Douglas N. Harris

Now that the election is over, the Obama administration and policymakers nationally can return to governing.  Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.

In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.

In many respects, The Race was well designed. It addresses an important problem – the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The administration also smartly embraced the idea of multiple measures of teacher performance.

But they also made one decision that I think was a mistake.  They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.

The idea of combining the measures has some advantages.  For example, as I wrote in my book on about value-added measures, combined measures have greater reliability and probably better validity as well.  But there is also one major issue: Teachers by and large do not like or trust value-added measures. There are some good reasons for this: The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance. There is more debate about whether the measures are, in any given year, providing useful information about “true” teacher performance (i.e., whether they are valid).

The larger problem is that policymakers have tended to look at the teacher evaluation problem like measurement experts rather than school leaders. Measurement experts naturally want validity and reliable measures—ones that accurately capture teacher effectiveness. School leaders, on the other hand, can and should be more concerned about whether the entire process leads to valid and reliable conclusions about teacher effectiveness. The process includes measures, but also clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes. It is that process, perhaps as much as the measures themselves, that instills trust in the system among educators. But the idea of combining multiple measures has short-circuited discussion about how the multiple measures—and especially value-added—could be used to create a better process.

One possible process comes from the medical profession. It is common for doctors to “screen” for major diseases, using procedures that can identify all the people who do have the disease, but some who do not (the latter being false positives). Those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate.  They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.

Ineffective teachers could be identified the same way.

Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.

The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers.  They are statistically noisy, for example, and so many low-performers will get high scores by chance.  For this reason, value-added would not be the sole screener.  Instead, some other measure could also be used as a screener.  If teachers failed on either measure, then that would be a reason for collecting additional information. (This approach also solves another problem discussed later.)

There is a second way in which value-added could be used as a screener – not of teachers, but of their teacher evaluators. To explain how, I need to say more about the “other” measures in an evaluation system. Almost every school system that has moved to alternative teacher evaluations has chosen to also use classroom observations by peers, master teachers, and/or school principals. The Danielson Framework, PLATO, and others are now household names among educators. Classroom observations have many advantages: They allow the observer to take account of the local context. They yield information that is more useful to teachers for improving practice.  And we can increase their reliability by observing teachers more often.

The difficulty is that these measures, too, have validity and reliability issues.  Two observers can look at the same classroom and see different things.  That problem is more likely when the observers vary in their training. Also, some observers might know teachers’ value-added scores and let those color their views during the observations – they might think, “I already know this teacher is not very good so I will give her a low score.”

Value-added measures might actually be used to fix these problems with classroom observations. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare (validate, if you will) the scores across individual observers.  Inaccurate classroom observation scores would likely show up as low correlations with value-added. Conversely, if observers were having their scores influenced by value-added, then the correlations might be very high, which might also be a red flag.

In these cases, an additional observer might be used to make sure the information is accurate. In other words, value-added can screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the system but without being the determining factor in personnel decisions.

This screening approach would solve a host of problems.

  1. The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems.  The National Education Association and the American Federation of Teachers themselves have been rightly critical of traditional-style evaluation systems because they provide so little useful feedback to teachers. Screening with value-added places the emphasis on formative, feedback-based measures such as observations.
  2. The screening approach represents a “feedback loop” in which both value-added and observations are used to ensure that the other is functioning well – i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking.  All measures have their flaws and value-added can help address these.
  3. The screening approach ensures that value-added measures are never the primary determinants of high-stakes personnel decisions. Rather, in this alternative proposal, value-added would only serve to trigger a closer look at a teacher’s performance, but the actual decisions would be based on classroom observations by experts. These have much greater support among teachers and provide more useful feedback.
  4. The screening approach helps schools focus their evaluation resources where they count: On low-performing teachers and low-performing classroom observers. This is crucial in these tough economic and fiscal times, during which schools must allocate resources carefully.
  5. The screening approach can be applied to all teachers, not just those in tested grades and subjects. A common criticism of value-added is that it cannot be applied to all teachers. With the approach I am proposing, only the initial screening process would differ (e.g., a single classroom observation that all teachers would receive) and the remainder of the process could be based on a more standard set of measures (additional classroom observations).
  6. The screening approach, because it works in all grades and subjects, avoids the unfortunate response, in states such as Florida, of expanding testing to every grade and subject. Teaching to the test is a real problem and this will make it worse. Value-added could serve to test screeners even of non-tested grades and subjects as long as those same screeners have some teachers in tested classrooms.
  7. The screening approach ensures that there is enough information that educational leaders will be able sleep at night knowing they are making the best possible personnel decisions – that their tough choices will not be over-turned by lawsuits alleging arbitrary and capricious firings.

Since I started with a medical analogy, some might want to call this a “triage” approach. This term fits in some ways but not in others.  In both cases, the focus is on allocating resources in cost-effective ways. The higher-performing teachers get less attention just as healthier patients do.  On the other hand, there is a difference between this approach and medical triage, as the latter entails devoting few resources to those who are least likely to make it. Instead, part of this point is to collect more information on these struggling teachers so that personnel decisions can be made with confidence and in keeping with legal requirements.

The screening approach certainly wouldn’t solve all the problems with the new teacher evaluation systems. The choice of additional measures beyond value-added, and the implementation of these measures, are critical. So are the ways in which the evaluations are used in personnel decisions.

Value-added measures have played a valuable role in sparking this important debate, but they need not do all the heavy lifting for our reformed teacher evaluation systems. We need more than a number, but a process for identifying low-performing teachers and helping them get better.

Also on The Answer Sheet

A 'value-added' travesty for an award-winning teacher