The teachers' union in Houston sued their school district this spring after the district started evaluating teachers based on students' test scores. Such evaluations are becoming more common around the country despite the objections of teachers, who say they are unfair and misleading.

Supporters of these evaluations often point to influential research showing a connection among teachers, test scores and students' success in adulthood -- research that has been cited by President Obama and used in recent cases challenging teacher tenure.

Now, that finding has come under renewed criticism in a new working paper from Jesse Rothstein, an economist at the University of California, Berkeley, who has long argued that students' standardized test scores might not consistently be a reliable measure of teacher competence. Rothstein argues that existing research -- and the politicians and school board officials who rely on it -- may be underestimating how often teachers who are seen as particularly skilled actually are just assigned good students, or seek them out, rather than making a difference in test scores themselves.

The leading recent study underpinning the idea that teachers can make a difference, measurable by standardized test scores, was done by economist Raj Chetty of Harvard University and two colleagues in two parts. The group found that if a teacher's students performed well on tests, that teacher's students also earned more money later in life. The authors based this conclusion on reams of historical data, including test scores and tax returns for millions of people who attended grade school in an urban school district that they did not identify.

Obama referred to this study in 2012 when he said, "We know a good teacher can increase the lifetime income of a classroom by over $250,000." To be precise, assigning the teacher with the highest scores out of a group of 20 teachers to a classroom full of students is like investing $250,000 and paying the dividends back to those students over the course of their careers.

Of course, this information is not especially useful if certain teachers' students do better on tests because those teachers tend to get the best kids in their classrooms.

For example, one teacher might be recognized by his colleagues as particularly good with students who don't speak English, while another who isn't comfortable disciplining students might be assigned the obedient children who enjoy school. The second teacher's students will have better test scores and might make more money in the long run than the first's, but not because the second teacher is more competent.

Chetty and his colleagues, Columbia University's Jonah Rockoff and Harvard's John Friedman, explored this possibility in their original paper. They examined the average test score for an entire grade at a particular school, reasoning that if certain teachers were getting all the good students, then the other teachers' scores would have decreased, and the average would remain the same.

They found that when a teacher whose students did well on tests moved to a different school, for example, the average score across that teacher's grade at the new school did improve, indicating that it wasn't just a matter of schools assigning the best students to particular teachers.

Rothstein wasn't convinced. Looking at a similar set of data on test scores and incomes from North Carolina, he observed the same correlation between students' teachers, their test scores and how well they turn out later in life.

Yet he found that scores seemed to be changing throughout the school in some cases, suggesting that certain teachers were looking for an opportunity to teach better students, maybe by moving to schools in gentrifying neighborhoods.

When a teacher whose students do well on tests moves to a school where test scores were improving the previous year, and average scores continue improving after that teacher arrives, it is hard to know how much of that continued improvement is due to the new teacher and how much to other factors.

Chetty's group responded to Rothstein's analysis, saying that they are pleased that the the test scores from North Carolina show the same connections they found, even though they disagree with Rothstein on how much that relationship means.

"Everybody has the same findings in all the data sets, which is actually quite rare in economics. That's a good starting point," Chetty said, adding that another group of researchers is working on a similar paper that confirms these results.

With regard to Rothstein's criticism, they argue that what appear to be larger trends in test scores across a school are in fact due to statistical quirks and noise in the data. For example, test scores for a school's students in particular subjects can vary widely from year to year, and an usually good year for fourth-grade math could distort the results for fifth-grade math the next year.

This dispute is just one example of the mathematical acrobatics required to isolate the effect of one teacher on their students' test scores, when so many other factors inside and outside the school's walls affect how students perform.

School districts around the country have already implemented what are known as "value-added measures" to comply with what is effectively the administration's requirement that teachers be evaluated at least partly on the basis of student achievement. Still, economists agree that some important issues remain unresolved.

Chetty said that whether value-added scores are the best way to assess teachers is still an open question. His group's paper doesn't examine alternative methods, such as observations by other faculty members.

Chetty added that basing teacher's salaries or tenure on value-added measures could have unforeseen consequences if teachers don't try to develop important qualities in their students beyond academics.

"Once you start using value-added measures in practice, their signal quality might get eroded," he said. "People might start teaching to the test." He's optimistic that teachers wouldn't make this mistake, but doesn't yet have the data to support that hunch.

Rothstein said the value-added approach can be fruitful as long as it not misused. "I don't see myself as someone who is arguing that these models can never have any useful information in them. I don’t believe that," he said. "I'm arguing that we need to go in with our eyes open."