Last week I wrote about the push by school reformers to use student standardized test scores to evaluate teachers in a post that looked at Chicago Mayor Rahm Emanuel’s reforms and a New York Times editorial that called such assessment “sensible.” On Monday, The Times published another editorial about teacher evaluation. Both education historian Diane Ravitch and award-winning New York high school principal Carol Burris were perplexed by the editorial, which appeared to defend Race to the Top evaluations while acknowledging that they were, in fact, problematic. Read the editorial and, below, what Ravitch and Burris had to say about it on Ravitch’s blog.
Diane Ravitch noted: The New York Times’ editorial about teacher evaluation was unusually odd. It sounded as though the writer knows there is no evidence to support using student test scores, but is trying to find a rationale for doing it anyway. There is literally not a single district one can point to and say, “It’s working here. Here is proof that using test scores to evaluate teachers produces excellence.”
The editorial claimed that Montgomery County’s much-admired Peer Assistance and Review Program relies on test scores. It sounds like Cinderella’s ugly sister is trying to stuff her big foot into the glass slipper. Montgomery County turned down $12 million in Race to the Top funding to avoid using test scores to evaluate teachers. Its peer assistance program works far better than the value-added test-based evaluations now adopted in many states and districts in which test scores count for as much as 50% of a teacher’s “grade.”
The Times editorial is just one more beat on the same broken drum. It attempts to distance the Chicago plan from other evaluation plans, which with the exception of Montgomery County’s, are more like Chicago’s than not. Montgomery County’s longstanding plan, does not use test scores for evaluation and it is focuses on teacher improvement, not sorting and dismissal.
The column bases its arguments on the same false assumptions that folks like Michelle Rhee have sold to the public. The first is that teacher evaluation is universally broken. This assumption comes from the report, The Widget Effect , produced by Rhee’s group, the New Teacher Project. It drew its conclusions from twelve selected districts. Evaluation is not broken in Montgomery County and it is certainly not broken at my high school. Many districts have sound evaluation systems that help teachers become more effective — they are not teacher dismissal machines but rather supervision models designed to improve instruction.
The second false assumption is that excellent teachers leave districts because they are not rewarded (translate, receive merit pay). Again, there is no factual evidence to support this. Merit pay is neither effective nor is it desired by teachers — it is a gift of public funds at a time when schools can ill afford it.
The third false assumption is that as long as we decrease the percentage of the evaluation number derived from value-added measurment (VAM) scores, we can make it all work. The editorial uses IMPACT, the Washington D.C. teacher evaluation system instituted by former schools chancellor Michelle Rhee, as an example. They attribute the decision by Rhee’s successor as chancellor, Kaya Henderson, to decrease the percentage of VAM in evaluations to “teacher anxiety.” I find that remark, which reformers often use to describe teacher responses to these systems, to be both paternalistic at best and sexist at worst. Teachers object to VAM because they know its limitations and flaws. It was never designed to evaluate individual teachers; it was designed by researchers to be a tool to assess systems and programs. Using VAM to evaluate teachers is akin to using Lysol as a mouth wash because it does a good job killing germs on your kitchen counter.
Teachers and principals of grades 4-8 were recently assigned “growth scores” by the New York State Education Department. The model that the department used was a hybrid of a growth model and a VAM model. The American Institute for Research (AIR), which created the model, also produced a technical manual to explain the resulting scores. You can find that manual here:
It is well worth a careful read. AIR was remarkably candid explaining the limitations. Here are some highlights:
* Although AIR preferred to use three years of prior scores as the baseline for growth, such data was available for Grades 6- 8 only. Grades 4 and 5 had limited prior data which was reflected in larger error, especially in Grade 4.
* There was no way to identify co-teachers or support teachers, and a little over half of all student scores in grades 4 and 5 were attributed to principals only, because they could not be correctly linked to teachers.
* They only co-variates (predictor variables in the model) were English Language Learner status, Students With Disabilities status (with all disabilities mild and severe considered the same) and economic disadvantage.
* Race, ethnicity, class size, spending, attendance and a host of other variables which are know correlates with student performance were not included.
Perhaps the most important problems with the model are explained on pages 24 – 30. AIR clearly shows how as the percentage of students with disabilities and students of poverty in a class or school increases, the average teacher or principal growth score decreases. In short, the larger the share of such students, the more the teacher and principal are disadvantaged by the model. Regarding ELL students, the report indicates that some teachers are advantaged, while others are disadvantaged. This should come as no surprise — well-educated students from China and students from rural areas of El Salvador with interrupted education are both classified as ELL, but their growth, as measured by test scores, is quite different.
Likewise, in this model, teachers who have students whose prior test scores are higher are advantaged, while teachers whose students have lower prior achievement are disadvantaged. This phenomenon, known as peer effects, has been observed in the literature since the 1980s. It is a root cause of the widening of the test score gap among classes in tracked schools. It has also been found in school to school comparisons as well. In a study of Houston Schools after Katrina, the schools which received a large share of high performing students from New Orleans saw their original students’ scores rise, and those who received a large share of low performing students from New Orleans saw their original students’ scores decrease.
Perhaps the best critique of the model comes from AIR itself. They conclude “the model selected to estimate growth scores for New York State represents a first effort to produce fair and accurate estimates of individual teacher and principal effectiveness based on a limited set of data” (p. 35). Not “our best attempt,” not even a “good first attempt,” but rather a “first effort.” And yet, across the state, teachers and principals have received scores telling them that they are ineffective in producing student learning growth.
I can assure those who believe that teachers are simply anxious, that their objections are not something that a Xanax will cure. Teachers and principals are smart and savvy. They are outraged, not anxious.
Follow The Answer Sheet every day by bookmarking www.washingtonpost.com/blogs/answer-sheet .