(Photo by ISTOCKPHOTO)

It has been seven years since the organization known as TNTP released a seminal study that showed that fewer than 1 percent of teachers were rated “unsatisfactory” on annual evaluations. The report accelerated a nationwide movement to overhaul teacher evaluations to more accurately reflect the range of teacher performance in U.S. classrooms.

So are the new evaluations — many of which incorporate test scores or other measures of student learning — any better at identifying poor teaching?

Not really, according to a new working paper by Matthew Kraft of Brown University and Allison Gilmour of Vanderbilt.

And that’s a problem for those who believe that evaluations should be used as a way to help teachers improve, and those who believe evaluations should be used as a way to get rid of poor performers, Kraft said.

“Both avenues require evaluations that are both accurate and meaningfully differentiate among teachers,” he said. “We have to know who the struggling teachers are if we want to help them improve, and we have to know who they are if they continue to struggle, and they should be in another line of work.”

Kraft and Gilmour pulled together publicly available data on teacher performance ratings from 19 states with new evaluations and found that the median proportion of teachers deemed below proficient has ticked up from less than 1 percent to less than 3 percent. An improvement, they write, “but not a landmark change in ratings.”

They found a wide variation among states, from Hawaii — where fewer than 1 percent of teachers were judged below proficient — to New Mexico, where 26 percent of teachers fell into that realm.


Percentage of teachers rated below proficient across 19 states. Source: “Revisiting the Widget Effect” by Kraft and Gilmour.

There also was enormous variation at the top of the scale. Just 3 percent of teachers rated above proficient in Georgia, for example, compared to 73 percent in Tennessee.


Percentage of teachers rated above proficient across 19 states. From “Revisiting the Widget Effect” by Kraft and Gilmour.

So why aren’t evaluations doing a better job of identifying the weakest teachers?

The researchers surveyed and interviewed 100 principals in an urban district that adopted new evaluations in 2012-2013, and what they learned was illuminating.

On average that first year, principals estimated that about 28 percent of teachers in their buildings were performing below proficient, but they also predicted that they would assign low ratings to just 24 percent, openly acknowledging that they would inflate some teachers’ scores. At the year’s end, however, it turned out that fewer than 7 percent of teachers actually received ratings below proficient. 

Some principals felt uncomfortable delivering bad news to teachers. Others told the researchers that they didn’t have adequate time to deal with all the documentation and support that comes along with giving a teacher a poor rating, so they had to be judicious. As one middle school principal told the researchers:

“It’s not possible for an administrator to carry through on ten unsatisfactories simultaneously. I mean once somebody is identified as unsatisfactory, the amount of work, the amount of observation, the amount of time and attention that it requires to support them can become overwhelming.”

But sometimes principals gave inflated ratings because they didn’t want to discourage teachers whom they believed had potential or were working hard to improve; or they did not have faith that they could hire a stronger replacement teacher for the weak teacher already in the classroom; or they decided it was less time-consuming to urge a teacher to find a job elsewhere than to go through the process of assigning and justifying a low rating.

To Kraft, the interviews show the difference between how policies play out in theory and on the ground, where evaluators are dealing with the constraints and challenges of real life. “There are totally rational reasons” why principals might inflate a teacher’s rating, he said, and policymakers should pay attention to those reasons as they’re thinking about how to improve evaluation systems.

For example, principals say that a low rating can sometimes create a wall that makes it more difficult for a teacher to hear and act on constructive criticism. Maybe that means that policymakers should consider separating observations meant to support teachers and help them improve from the high-stakes evaluation process, Kraft said.

Kraft said his team received no outside funding for the working paper. It has not been peer-reviewed, but two researchers — Dan Goldhaber at the University of Washington and Matthew Di Carlo of the Albert Shanker Institute, a nonprofit endowed by the American Federation of Teachers — agreed to read it at The Washington Post’s request.

Goldhaber said he was not surprised by the findings, given his own study of state evaluation systems, but found them “depressing.”

“It appears that there is little differentiation in spite of tremendous investments in performance evaluation systems that are supposed to have been more rigorous,” he said. “This is a problem of the politics around evaluation, not a problem that was or can be solved by technical changes to the way evaluations occur.”

Goldhaber said that anyone who cares about making sure that disadvantaged children have access to great teachers should be concerned about the findings. If weak teachers don’t have something in their formal file indicating that they are poor performers, he said, it makes it more difficult to see and address unevenness in teaching talent across schools.

“The fact that people can’t seem to be honest about teacher evaluations causes greater inequity,” he said.

Di Carlo said that the very low number of teachers receiving the lowest ratings is probably a sign of a need for adjustment, but also that policymakers should not react by trying to achieve some sort of ideal distribution of teachers across various categories. There is no one ideal distribution, he said, because teacher ratings are influenced by the incentives built into evaluation systems.

“For example, we might expect fewer teachers to receive low ratings in a system where receiving those ratings results in high stakes consequences, compared with a system in which the rating triggers a lower stakes result, such as professional development,” he wrote in an email. “Similarly, a state or district that awards large bonuses to their highest-rated teachers might design a system in which fewer teachers receive that highest rating, compared with a state or district in which no such concrete rewards are offered.”

For the same reason, he said, it can be misleading to directly compare the distributions of teachers in various states.

Di Carlo said that the interviews with principals were eye-opening, particularly principals’ acknowledgement that they expected to assign ratings that were at odds with their judgment of teachers’ actual performance. “To whatever degree this relatively small group of evaluators is representative of evaluators nationwide, this is troubling,” he wrote.

But Di Carlo said the interviews show that the reasons that principals sometimes give inflated ratings vary widely, and are in some cases — such as when a principal wants to avoid discouraging a young teacher — strategic. “It’s not just bureaucratic headaches, as is sometimes implied,” he wrote.