The Washington PostDemocracy Dies in Darkness

The key to evaluating teachers: Ask kids what they think

Thomas Kane is professor of education and economics at the Harvard Graduate School of Education and faculty director of the Center for Education Policy Research, which works with states and municipalities to evaluate educational policies. He recently partnered with the Bill and Melinda Gates Foundation on the "Measures of Effective Teaching" (or MET) project, which was intended to develop metrics capable of determining which teachers are faring better than others, and to determine what factors help determine success.
He recently wrapped up a randomized study with MET that identified a number of factors associated with quality teaching. We spoke on the phone Feb. 1; a lightly edited transcript follows.

Dylan Matthews: Tell me a bit about how this study differs from the rest of the literature around standardized testing.

Thomas Kane: So for 40 years, we have known that when similar students enter different teachers’ classrooms, they come out with very different achievement. For 40 years we have designed our education policies as though that weren’t true. Very few of those differences had anything to do with teachers' paper credentials, yet that’s the only thing that state and local policies focused on. They only focused on paper credentials, and they didn’t systematically try to evaluate performance on the job for teachers.

The test scores, we knew, were just the most obvious manifestations of what is a difference in practice underneath, but nobody was systematically trying to find ways to measure those differences in practices. Quite the opposite. Most classroom observations were entirely perfunctory. Teachers, 98-plus percent of teachers, were given the same "satisfactory" rating, if their principal did an observation at all.

It was within that context that we said, "Let’s go out and try to identify some ways to identify effective teaching that help illuminate what’s going on with the difference in test scores." We want to know that these are at least related to the magnitude of gains that teachers provide. So let’s do that in a way where we could develop measures that could be implemented widely. That was one of the advantages of trying to start with such a large scale [3,000 teachers]. If we had tried to do it with 250 or 200 teachers, we’d have something you could do on a small scale, not a large scale.

You asked how was this different. Before 2007, there were two feuding camps in the teacher effects world. There were the outcome, or value-added only, group, that tended to focus just on outcomes and say, “Look, the effort to try to measure practice is just opinions, and subjective, and hopeless. So let’s focus on the outcome data.” And then there were the folks who tended to focus on practice. There was an organization called the National Board for Professional Teaching Standards, where a teacher could submit videos, but there was actually a hesitance to include student achievement in any of those measures, for many reasons, many of them ideological.

We tried to collect data from student surveys so that we might in the process bring together what had been very separate research. They were publishing in their own journals. To the extent that they were aware of what the others were doing, they were dismissive and critical of what others were doing. By creating this framework where we were using test score gains to validate practice-based measures, we were at least creating a common base for discussion.

Dylan Matthews: And your methodology was tailored in a lot of ways to address the concerns of potential critics in those camps.

Thomas Kane: So there were two things that we did specifically because we were taking the concerns of skeptics seriously, but seriously enough to test them. One thing that we did was a random assignment. That was explicitly set out because skeptics of value-added had correctly pointed out that you can control for students’ observable traits like their baseline test scores, but there are lots of other potential determinants of student achievement that are unobservable, and if students are being sorted to teachers based on those unobservable traits, you could be seeing teachers with exceptional students, not exceptional teachers. That’s why we did random assignment, because we knew that this was one of the major concerns of skeptics, that there were these unobservable traits.

We didn’t just look at student achievement gains on state tests, but on supplemental assessments as well. People could point out that teachers who are achieving gains on the state tests could just be teaching to the test, so you wouldn’t want to have an evaluation system that was being evaluated against teaching to the test, the practices associated with drill and kill as opposed to practices associated with student learning.

Rather than dismiss it, but also not to say, "You’re right, we shouldn’t do any of this assessment," we said, "Let’s find an assessment that measures some of these things that aren’t being captured in state tests." We used, for example, the Balanced Assessment in Mathematics. There’s very little time spent on whether kids understand math procedures. Give a kid a couple numbers, and ask them to add them — that’s just testing their procedural understanding. But instead, we give them open-ended word problems that test their understanding of math.

Dylan Matthews: So what big things did you learn?

Thomas Kane: I think what we showed was that if you combine data from three different sources, from classroom observation, a student survey and a teacher’s past history of achievement gains, controlling in the ways that school districts are now commonly controlling for them, by controlling for students’ baseline test scores, you can identify teachers who cause greater learning to happen, and I can use the word "cause" because we used random assignment.

The teachers who appear more effective will not only generate greater gains on the state tests that you’re measuring, but they’re also generating greater gains on the supplemental tests that we saw, that the state wouldn’t see. Their students are reporting subjectively that they enjoy being in those teachers’ classes. That’s relative to what we have now, which is nothing. That’s a big deal. Do I think these measures will get better? Yes, and we can talk about what I think will be the next round, where we need to get better. But relative to the information that we’re giving principals now for making personnel decisions now, and teachers who want to do better, it’d be a huge step forward to combine these three.

Dylan Matthews: How much effect did different amounts of time spent teaching have? A common criticism of successful charter schools like KIPP is that they work their teachers so hard that they're not tenable in the long-term.

Thomas Kane: That’s an interesting point. We did not measure just the number of hours in the day that a teacher devotes. We know that their in-school time is equivalent, and all of our comparisons are within school. We didn’t measure the amount of out-of-school time that teachers spend grading schoolwork, maybe working with students outside of class, and it could be that the differences are there. So we didn’t have a direct test of that.

But our main goal was just to say, "What kinds of data could school districts be collecting either through classroom observations or student surveys to identify teachers who were having big impacts on kids?" I think, practically, if we learned that part of it was just that these teachers were working longer hours, that begs the question, well how would you measure it at a teacher level? If you get teachers to self-report the number of hours that they’re spending and you build that into their evaluations -- I never ask my son how much time he spends doing his homework. I read the homework. So I think that could be really important for telling us something about the length of the school day.

Dylan Matthews: Back to the three measures you think are important. My understanding is that classroom observation and student surveys — the two subjective measures — don't add much once you're already using value-added metrics. So why add them? Why not just use value-added?

Thomas Kane: Clearly one of the objectives is: Let’s find teachers who are most likely to generate gains and achievement on the state tests. And actually, if that’s your only objective, you’d put a ton of weight just on the teacher’s past history of promoting achievement on the state tests. Once you have a teacher’s past history of promoting achievement gains on the state tests, the classroom observations add relatively little.

But, two things that are a little different. The incremental value of the observations is different from asking if the results are related to achievement gains. Part of why they add little is that the part they add is redundant with achievement gains. But predicting future achievement gains is not the only objective. One thing you’d want to do is identify teachers who are more likely to promote achievement on these unmeasured skills. Another measure is reliability, and giving teachers feedback on specific practices that they may try to improve. Those other measures do better.

You want a measure that doesn’t bounce around extraordinarily from year to year. The student surveys are the least volatile from year to year and class to class. The reason is just the number of kids and the number of classroom days that they observe. We saw in the classroom observations that even if you take a trained observer and show them the same class, you’ll get differences. I know in D.C. they use outside peers, and our research suggests that’s a good idea. And most places aren’t doing that. Because there’s a judgment measure that you’d never want to eliminate, but you need to average over judges to get reliability.

You're getting two, three adults observing a class, but you might have 25 kids in elementary school classrooms, so the power of averaging really lends to the reliability of the student survey. For that criterion, the value-added scores are really not very helpful at all. But the student surveys and the classroom observations do point to things that a teacher could track his or her progress on. So because each different component -- the classroom observation, the surveys and the value-added gains -- excel at different objectives, that’s why it makes sense to combine them.

Dylan Matthews: The findings on outside observers who watch videotapes versus observe in person are particularly interesting.

Thomas Kane: It’s interesting, we did not have people walk into the classrooms. We had video cameras, and then we had people watch the videos and score them. A couple of things, though, that we learned from that. There was this one study, one of the reports we released was a study focused on Hillsborough, and so teachers could pick which four videos their administrator would see, but they’d collected 25 videos last year in Hillsborough, so we could have other observers, other than administrators, watch their videos so we could say, "How did their performance differ on days they show their principal versus days they didn’t want to show their principal?"

While the mean score was higher on the days that the teachers chose to submit, once you corrected for measurement error, a teacher’s score on their chosen videos and on their unchosen videos were correlated at 1. They were perfectly correlated. The people who struggled on the lessons they’re willing to submit are also the people struggling on the lessons they didn’t submit. The best lesson from the best teacher is that much higher than the best lesson from the worst teacher. The order is preserved even if the mean rises.

That has huge implications, because it means that the element of surprise may not be that important. That contributes a huge degree of anxiety, to have observers be able to pop in whenever they like. Give them a camera and say, "Submit four to five lessons you’re particularly proud of." I think that would remove some of the anxiety that made this hard, and in the process would have all sorts of other benefits. It would allow principals to time-shift. It would make it easier to get people outside the school involved in education. D.C. spends a lot to get those master educators to drive around to schools. If you could do this video-based thing, and still have them sit down with a teacher one on one to discuss these three or four lessons they submitted, rather than go out there physically, I just think it’d be a more efficient way to do it.

Hundreds of thousands of observations, maybe not millions, will be done with digital video rather than in person. This is just anecdotal evidence, but many teachers told us that having the camera in the room was a lot less distracting than having an adult in the room. You’re trying to read their body language, see if they’re bored, but the camera quickly disappears from people's consciousness.

Dylan Matthews: Now, the three components you measure predict future performance on achievement tests. But a lot of people dismiss that, even though there's growing evidence that achievement test scores are correlated with all kinds of important real-life outcomes. Why do these scores matter?

Thomas Kane: So I’m really heartened — even asking that question means you’re aware of that Chetty/Friedman/Rockoff study. That’s the main one that gives me hope. Yeah, that was in a not experimental context, but they showed not only — some of the reporting on that has been misleading. In fact, in last year’s State of the Union speech, the president said, "We know teachers have impacts on earnings later."

Well, that’s true, but actually if that was all Raj and Jonah and John had found, that wouldn’t have been all that useful. Because gosh. If you have to wait 20 years to find out who's a good teacher, you could send the teachers flowers but it’s hard to do anything in terms of policy. But that’s not what they found. What they found was that the teachers who appeared to have high value-added while the students were in their classrooms, their value-added was predictive of students’ income later. I’m optimistic about it but based largely on Raj’s study.