Arne uncan (Bill O’Leary/The Washington Post)
Education Secretary Arne Duncan (Bill O’Leary/The Washington Post)

On March 4, Education Secretary Arne Duncan appeared before the House Appropriations Subcommittee on Labor, Health and Human Services, Education, and Related Agencies to discuss the Obama administration’s 2016 budget request for the Education Department. After giving testimony, he answered questions from panel members, one of them being Connecticut Rep. Rosa DeLauro, the highest-ranking Democrat on the committee. At one point, the conversation turned to teacher evaluation, and that’s when Duncan said something that was odd given his department’s policies.

The discussion was about VAM, the value-added method of evaluation in which student standardized test scores are used to evaluate teachers (and sometimes principals and other adults in the school building).  VAM purports to be able to take student standardized test scores, plug them into a complicated formula (of which there are many) and measure the “value” that a teacher supposedly adds to student learning. These formulas are said by  supporters to be able to factor out things such as a student’s intelligence, whether the student is hungry, sick or is subject to violence at home, or any other factor that could affect performance on a test beyond the teacher’s input. Many assessment experts say that is impossible for these formulas to do.

The method has been supported by the Obama administration and adopted as a part of teacher evaluations in most states — with varying weights put on the results — but for years researchers have said the results aren’t close to being accurate enough to use for decisions that matter, with a growing number of studies showing how unreliable these scores are.

Making matters worse is this: Because high-stakes tests are only given in math and English Language Arts, and because all teachers are supposed to be evaluated by VAM scores, most teachers wind up being evaluated in part on the test scores of students they don’t have and/or subjects they don’t teach (by, for example, using school-wide averages for some teachers, or linking one non-tested subject area, such as science, with a tested area, such as math, and evaluating science teachers on math scores. (You can read here about an art teacher evaluated on English Language Arts scores.)

[How students with top test scores actually hurt a teacher’s evaluation]

There’s something else to know about many VAM formulas: Teachers are actually evaluated by scores that have been “predicted” for each student based on past performance on state-mandated tests of that student and other students.  If the student exceeds the predicted score, the teacher is credited with “adding value,” but if the student does not do as well as the predicted score, the teacher is held responsible and that score counts negatively towards his/her evaluation. A worse-case scenario in this regard just happened to a Florida teacher, which you can read about here.

Among the many reports about the problems with VAM was one a year ago from the American Statistical Association, the largest organization in the United States representing statisticians and related professionals (people who know a great deal about numbers), which said that value-added scores “do not directly measure potential teacher contributions toward other student outcomes” and that they “typically measure correlation, not causation.” It said that “effects — positive or negative — attributed to a teacher may actually be caused by other factors that are not captured in the model.”

The newly published edition of Educational Researcher is devoted to VAM and includes five articles that a) recognize that federal requirements for measuring student “growth” as part of test-based teacher evaluation use what is identified as value-added modeling, and, b) provide examples of the problems with VAM or growth models to evaluate teachers based on student test scores.

Stanford University Professor Linda Darling-Hammond, an authority on teacher preparation and evaluation, wrote in a related commentary in the same journal that, among other things, the five articles detail problems with the instability of VAM results “year to year, class to class, and test to test, as well as across statistical models, test scores”‘ and with how imprecise the value-added estimates are and how large the degree of error can be. (Darling-Hammond wrote: “As Sean Corcoran (2010) documented with New York City data, after taking statistical uncertainty into account, the “true” effectiveness of a teacher ranked in the 43rd percentile on New York City’s Teacher Data Report might have a range of possible scores from the 15th to the 71st percentile, qualifying as ‘below average,’ ‘average,’ or close to ‘above average.’ “)

She also noted that VAM models “appear particularly inaccurate for teachers whose students achieve below or above grade level” because of the peculiarities of how standardized tests used for accountability purposes in the United States today are designed.

All of this brings us back to March 4, when Duncan appeared before the House committee. Here’s how part of the conversation went, according to a transcript made from a videotape of those proceedings:

Rosa DeLauro: Are you prepared to rethink the federal requirement that VAM data be included in teacher evaluation scores for those states receiving a waiver from NCLB?

Arne Duncan: Your question is actually incorrect. We never say that you have to use Value Added — we say that student learning and student growth needs to be a part of that. … When we first came to Washington, I was stunned to learn that there were some states where it against the law to link student learning to evaluating teachers. …  The goal of great teaching is never just to teach, but to have students learn. ….  So to be clear, we always say “multiple measures,” there are a whole host of thing that need to be there, we want to elevate and strengthen the teaching profession.

deLauro: So there is no emphasis on testing or test scores?

Duncan: There is no requirement on value added. … What we are saying is that student learning, growth, needs to be a part of teacher evaluation. People are taking this to the extreme and saying either complete reliance on testing or zero accountability. …  I’m interested in student growth and gain, how much are students progressing each year. We want to look at how much students are improving each year.

deLauro: Ranking teachers by VAM can have unintended consequences.

Duncan: So just to be very clear, we have never advocated ranking teachers by test scores.

Chairman Tom Cole: After spending three days watching Republicans argue with each other, it’s nice to see Democrats arguing.

Duncan: We are not arguing.

Whether or not they were arguing is beside the point. The point is that Duncan said that there is no federal requirement for value-added modeling. Well, any state that won Race to the Top money, and any state that applied for and won a federal No Child Left Behind — and that’s most of the states — was required to measure student “growth” as part of teacher evaluation. And assessment experts say that the only way to get such a growth score is with value-added modeling.

While it is true that states did not have to apply for Race to the Top money nor did they have to seek a waiver from the most onerous parts of the flawed No Child Left Behind law, budget problems and the impossible demands of NCLB made it hard for states not to do so.

As for Duncan’s statement that “people are taking this to the extreme and saying either complete reliance on testing or zero accountability,” well, that sort of misses the point that any reliance on a flawed evaluation system is unfair.

When Duncan said that “we want to look at how much students are improving each year,” he stated a laudatory goal. Who doesn’t want to look at how much students are improving each year? The problem is with the method he has championed: using “growth” scores derived from a severely flawed methodology that experts say is unreliable and invalid.

When deLauro said that “ranking teachers by VAM can have unintended consequences,” it gave Duncan the chance to say that “we have never advocated ranking teachers by test scores.” There is, of course, a distinction between ranking and rating, and teachers are rated rather than ranked. In this case, though, it seems to be a distinction without much of a difference; VAM is used for rating, and the problem is with VAM, not with whether teachers are being rated or ranked.

Back in 2014 after the American Statistical Association’s report was published as well as a few other reports were published warning about using VAM to evaluate teachers, I asked the Education Department whether Duncan had seen the research. I  received this response in an e-mail from his press secretary, Dorie Nolt, reflecting Duncan’s position:

“Including measures of how well students are learning as part of multiple indicators of educator effectiveness is part of a set of long-needed changes that will improve classroom learning for kids. Growth measures are a significant improvement over the system that existed before, which failed to produce useful distinctions in teacher performance. Growth measures — including value-added measures — focus attention on student learning and show progress. While these measures are better than what existed before, educators will continue to improve them, and sharp, critical attention from the research community can help.”

She also added that “we keep track of all major research on this topic.”

Well, they may keep track of it, but listening to it is another story.

You may also be interested in:

How students with top test scores actually hurt a teacher’s evaluation

 How is this fair? An art teacher is evaluated by student test scores in math

Statisticians slam popular teacher evaluation method