The Washington PostDemocracy Dies in Darkness

How a fabulous principal lost her job — and more damage the misuse of test scores has caused

Placeholder while article actions load

In the post above this, I did a Q & A with Daniel Koretz, a professor of education at Harvard University and an educational assessment and testing policy expert who wrote the newly published book “The Testing Charade: Pretending to Make Schools Better.” This is an excerpt of the book, and I am publishing it with permission:

Why the school ‘accountability movement’ based on standardized tests is nothing more than ‘a charade’

Taking Test Scores Out of Context
Our test-based accountability system takes test scores out of context. That was a deliberate goal of the reformers; they wanted measures that someone sitting in a state capital could interpret without ever looking at the school from which they were obtained.
However, that’s one of the main reasons the reforms have failed. Test scores taken out of context often don’t tell you what you need to know.
Consider the case of Joyce Irvine, who from 2004 until 2010 was the principal of Wheeler Elementary School in Burlington, Vermont. Wheeler serves a highly disadvantaged population; in 2010 thirty-seven of the thirty-nine fifth-grade students were either refugees or special-education students. During her time at Wheeler, Irvine added a number of enrichment programs, including a summer school, and converted the school into an arts magnet. She worked very hard—often eighty hours per week—and both her hard work and her success were recognized by her colleagues and her superiors. Her final evaluation began, “Joyce has successfully completed a phenomenal year,” and the superintendent called her “a leader among her colleagues.”
In 2010 she was fired from her job as principal and assigned to a lower-paying administrative position. The reason: low test scores. Under one of Arne Duncan’s policies, to qualify for funds from the federal economic stimulus program, the district had to replace the school with a charter school (the state had none), remove the principal and half the staff, or remove the principal and “transform” the school. Irvine had to go. As she said, “Joyce Irvine versus millions. You can buy a lot of help for children with that money.”
Was it reasonable to be concerned because Wheeler’s scores were low? Of course. Signaling potential problems is one of the most important functions tests can serve. Was it reasonable to assume that Wheeler’s low scores reflected a lack of competence or effort on the part of its principal? Of course not. The fundamental mistake this illustrates is taking scores out of context. The system didn’t require that the district consider whether the school produced a reasonable level of achievement or had produced an acceptable rate of improvement given the circumstances it faced. In fact, it made that sort of judgment irrelevant. Poor scores are taken to be sufficient to indicate that the school’s staff is failing, and for the most part the circumstances confronting the school can’t be taken into account. It’s much like insisting that doctors treat patients based on one symptom without considering which of its many possible causes is the relevant one. Because the quality of education is confounded with context, good educators teaching under difficult circumstances are punished. By the same token, weak teachers assigned to schools with high-achieving students get a pass.
Trying to Use Tests to Explain, Not Just Describe
The Wheeler School example illustrates another problem with using test scores to evaluate educators: mistakenly attributing scores—high or low—to the actions of educators, despite the many other factors that influence student achievement. Used properly—in particular, used in ways that limit score inflation—tests are very useful for describing what students know. On their own, however, tests simply aren’t sufficient to explain why they know it. This was explained clearly by the early designers of standardized tests, but their warnings are no longer heeded.
Of course the actions of educators do affect scores, but so do many other factors both inside and outside of school, such as their parents’ education. This has been well documented at least since the publication more than fifty years ago of the “Coleman Report,” Equality of Educational Opportunity, a huge study commissioned by the US Office of Education, which found that student background and parental education had a bigger impact than schooling on student achievement.4 While the debate about the precise relative contributions of schooling and background still rage in the research world, there is no doubt that factors other than schooling are enormously important. This is part of the explanation for the huge variation in student performance we find anywhere we look. And, of course, it is one of the many reasons that the one-size- fits- all approach to reform has backfired so badly.
Test scores would justify a conclusion about the effectiveness of educators only if one could somehow separate the impact of the other factors that influence scores. One can do that—albeit not completely—either by observing the school or by means of some statistical techniques (or both). That brings me to “value-added modeling.”
Using “Value-Added Modeling” to Evaluate Teachers
Even though some parts of our accountability system totally ignore factors other than the actions of educators that influence scores—the policy that forced Burlington to can Joyce Irvine is an example—policy makers haven’t been entirely blind to this issue.
Their primary response has been to rely on various types of “value-added modeling,” frequently dubbed VAM. Other than VAM, most of the test-based accountability systems reformers have imposed have taken one of two approaches. One, the NCLB approach, simply compares the performance of one cohort of kids—say this year’s fourth-graders— to an arbitrary standard, the “Proficient” cut score. Each year every school has a target, a percentage of kids in that cohort who should reach or exceed the Proficient standard. The second approach, exemplified by the Kentucky system I described earlier, bases accountability on the change between successive cohorts. If the percentage proficient in this year’s fourth-graders exceeds last year’s percentage (or the percentage two years previously, in Kentucky’s case) as much as your arbitrary target demands, you’re golden. If the percentage proficient doesn’t improve enough, you may be punished. VAMs are an entirely different approach. The original idea behind VAM was that we can track the scores of individual students over time to estimate how much each one improved in a given subject. Using earlier scores, we can predict how each student is likely to score at the end of his or her current grade. Sometimes these predictions are based solely on students’ earlier scores, while in other systems they take into account some background factors as well. Schools don’t have a lot of background information about students, but the variables included in the VAM model can include gender, receipt of free or reduced-price lunch, limited proficiency in English, or disability status. Either way, each student’s deviation from her predicted score for the current grade is assumed to measure the impact of a teacher’s work. The estimate of a teacher’s value added is obtained by adding these deviations from prediction for all her students. If a teacher’s students do better than predicted, that is taken to show that she is effective, but if they do worse, she is ineffective. Hence these deviations are taken to indicate the “value” the teacher has added to his students’ trajectories.
While this seems conceptually simple, it is devilishly hard in practice, raising very difficult problems of both measurement and statistical modeling. Moreover, while VAM began with this approach, it has spawned many others, some quite different from this first one, and the differences among them are technically complex. For simplicity, I’ll stick with this original approach.
The movement to use VAMs had two major motivations. The first is that it is simply more logical to hold teachers or schools accountable for how much students have improved over the year they taught them, rather than for the achievement they have accumulated over their entire lives to that date. If a student starts at a new school in fifth grade but came in at a second-grade reading level, it’s not fair to punish the new school for his low test scores, and by the same token, it’s not fair to credit a teacher for the high scores of a student who enters school that year already performing very well. The second motivation is that it helps separate the effects of teaching from all of the other things that influence test scores.
However, while VAMs help to separate the impact of teaching from everything else, they don’t solve the problem entirely. The explanation is technical. In the typical VAM approach we start by predicting students’ scores with a model that includes earlier scores and perhaps background factors. The estimate of VAM is obtained by adding up the discrepancies from those predictions—that is, by looking at variations in performance that we have not predicted using the variables in the model. This variation that we can’t predict could be a result of anything that isn’t included in the statistical model—including, but by no means limited to, the impact of the teacher. Using VAMs to evaluate teachers requires that one assume that this unpredicted variation among teachers is attributable to teachers, but that needn’t be entirely true. It might be, for example, that the teacher was given a particularly challenging or particularly easy cohort of kids this year. Or that there were a lot of illnesses that disrupted the class. Or that a new principal arrived, and order either improved or deteriorated or a result. Or that curricular changes altered the fit of the curriculum to the test.
The list of other things that can contribute is endless. In 2014 the American Statistical Association (ASA), the primary professional organization of statisticians in the United States, issued the ASA Statement on Using Value-Added Models for Educational Assessment. The ASA summary of this point was straightforward: “VAMs typically measure correlation, not causation: Effects—positive or negative—attributed to a teacher may actually be caused by other factors that are not captured in the model.”5 That is, if a VAM is used to estimate your “effectiveness” as a teacher, that estimate will sometimes blame or credit you for things that have nothing to do with your teaching.
So can’t we solve this by including students’ background characteristics in the model, as some systems do, to take their effects out of the remaining variation in scores? Not entirely. First, the factors that obscure the effects of teaching are not limited to students’ backgrounds; they also include other attributes of a school and community, such as school size, the amount of extracurricular tutoring students receive, the characteristics of students’ peers, teachers’ colleagues, and the school administration. Second, we generally have only very limited information about students’ backgrounds. Finally, controlling for students’ background can ironically do more harm than good. We know that in many settings, advantaged kids get better teachers (as mine did when I insisted they have Norka Padilla as their math teacher). When this is the case, we would want the evaluation system to pick it up. However, controlling for students’ background will also remove the differences in teacher quality that are associated with them.
This is not to say that VAM entirely fails to reflect the impact of teachers’ work. Done well—and if score inflation is held to a minimum—it does. The problem is twofold. Like any other metric based on test scores, it leaves a great deal unmeasured. And for the portion it does measure, it doesn’t dependably separate the effects of a teacher’s work from many other influences, both within the school and outside of it. And, of course, it only works at all in subjects that students study year after year. You can’t measure growth from year to year in subjects like chemistry that most students study only for one year.
Rating Teachers with the Wrong Test
Suppose you and I both teach science in the same grade and we are equally effective teachers. I am asked to teach a class that focuses on topics A, B, C, and D. You are asked to focus your instruction on topics C, D, E, and F. At the end of the year, our students are given a test that focuses on A, B, C, and D.
You lose. Your students’ test scores will appear to show that you’re “ineffective,” even though in fact we are equally good, because you didn’t teach A and B. No matter how important E and F were for your students—even if they were the core of what you were expected to teach—the time you put into them was simply wasted for purposes of your evaluation.
My example is contrived, but it’s not far-fetched. This sort of thing happens frequently, although often not in so extreme a fashion. It often affects educators who teach material that is advanced for their grade level, and it can be particularly severe when VAMs are used to evaluate educators. A good example appeared in a 2012 blog post by Aaron Pallas, a sociologist at Columbia University, titled “Meet the ‘Worst’ 8th Grade Math Teacher in NYC.” He described the case of Carolyn Abbott, a seventh-grade teacher in Anderson School, who was literally rated as the worst eighth-grade math teacher in New York City, which used VAMs to evaluate teachers. To put this in some context, when Abbott taught the cohort of kids at issue in the seventh grade, they scored at the ninety-eighth percentile. And as I explained, deviations from the predicted scores are interpreted as “effectiveness.” The difference between the ninety-seventh and eighty-ninth percentiles put Abbott at the bottom of the district’s rankings. This put her application for tenure in jeopardy.
The reason was poor alignment of the test with the material Abbott was supposed to teach. Anderson is an unusually advanced school, and much of its teaching is literally years above grade level. Pallas noted that much of the content assessed in the eighth-grade test is taught to Anderson students in the fifth or sixth grade. Abbott explained that she didn’t teach the curriculum her eighth-graders were tested on. Instead, she primarily taught the more advanced algebra that shows up on the state’s high-school Regents Integrated Algebra test. Because she was evaluated using the wrong test—and because scores are taken out of context—she couldn’t be “effective” no matter how well she taught.

Reprinted with permission from “The Testing Charade: Pretending to Make Schools Better,” by Daniel Koretz, published by the University of Chicago Press. © 2017 The University of Chicago. All rights reserved.