Microsoft co-founder Bill Gates and his wife, Melinda, speak during an AP interview in Kirkland, Wash., on Feb. 1. (Ted S. Warren/AP)
Reporter

A major new report concludes that a $575 million project partly underwritten by the Gates Foundation that used student test scores to evaluate teachers failed to achieve its goals of improving student achievement — as in, it didn’t work.

Put this in the “they-were-warned-but-didn’t-listen” category.

The six-year project began in 2009 when the foundation gave millions of dollars to three public school districts — Hillsborough County in Florida (the first to start the work), Memphis and Pittsburgh. The districts supplied matching funds. Four charter management organizations also were involved: Alliance College-Ready Public Schools; Aspire Public Schools; Green Dot Public Schools; and Partnerships to Uplift Communities Schools.

The Bill & Melinda Gates Foundation pumped nearly $215 million into the project while the partnering school organizations supplied their own money, for a total cost of $575 million. The aim was to create teacher evaluation systems that depended on student standardized test scores and observations by “peer evaluators.” These systems, it was conjectured, could identify the teachers who were most effective in improving student academic performance.

This, in turn, would help school leaders staff classrooms with the most effective teachers and would lead more low-income minority students to have the best teachers — or so the thinking went. Schools also agreed to boost professional development for teachers, give bonuses to educators evaluated as effective and change their recruitment process.

The 526-page report titled “Improving Teacher Effectiveness: Final Report,” conducted by the Rand Corp. says:

Overall, the initiative did not achieve its stated goals for students, particularly LIM [low-income minority] students. By the end of 2014-2015, student outcomes were not dramatically better than outcomes in similar sites that did not participate in the IP [Intensive Partnerships] initiative. Furthermore, in the sites where these analyses could be conducted, we did not find improvement in the effectiveness of newly hired teachers relative to experienced teachers; we found very few instances of improvement in the effectiveness of the teaching force overall; we found no evidence that LIM students had greater access than non-LIM students to effective teaching; and we found no increase in the retention of effective teachers, although we did find declines in the retention of ineffective teachers in most sites.

Why didn’t it work? The report’s authors couldn’t say:

Unfortunately, the evaluation cannot identify the reasons the IP initiative did not achieve its student outcome goals by 2014-2015. It is possible that the reforms are working but we failed to detect their effects because insufficient time has passed for effects to appear. It is also possible that the other schools in the same states we use for comparison purposes adopted similar reforms, limiting our ability to detect effects. However, if the findings of no effect are valid, the results might reflect a lack of successful models on which sites could draw in implementing the levers, problems in making use of teacher-evaluation measures to inform key HR decisions, the influence of state and local context, or insufficient attention to factors other than teacher quality.

The project began at a time when the newly elected Obama administration was supporting school reforms that used student test scores to evaluate teachers, despite warnings from assessment experts of big problems with doing so. Gates and Arne Duncan, who was education secretary at the time, were on the same page, believing that test scores were valid measures for high-stakes decisions.

The Obama administration, through its Race to the Top initiative, dangled federal funds in front of states that agreed to establish teacher evaluation systems using test scores to varying extents. And Gates funded his “Empowering Effective Teachers” project with the aim of finding proof that such systems could improve student achievement.

Some assessment experts were concerned from the start that the methods used to link student test scores to teacher evaluations were largely unfair and lacked statistical validity. Some educators noted that there were already effective evaluation systems for teachers that did not give weight to student test scores, including in Maryland’s Montgomery County and Virginia’s Fairfax County.

But the Gates project and Race to the Top continued, and most states adopted test-based teacher evaluation systems. In a desperate attempt to evaluate all teachers on tested subjects — reading and math — some of the systems wound up evaluating teachers on subjects they didn’t teach or on students they didn’t have. Some major organizations publicly questioned them, including the American Statistical Association, the largest organization in the United States representing statisticians and related professionals. And so did the Board on Testing and Assessment of the National Research Council.

But the Gates project continued. What happened in Hillsborough County is illustrative of problems that many warned about early on. Teachers who initially supported it came to realize its weaknesses. The project required district and union leaders to work together, which happened — but not for long. In 2015, Hillsborough County gave up on it, after more than $180 million was spent there. This is what I wrote in a 2015 post:

Under the system, 40 percent of a teacher’s evaluation would be based on student standardized test scores and the rest by observation from “peer evaluators.” It turned out that costs to maintain the program unexpectedly rose, forcing the district to spend millions of dollars more than it expected to spend. Furthermore, initial support among teachers waned, with teachers saying that they don’t think it accurately evaluated their effectiveness and that they could be too easily fired.

Now the new superintendent of schools in Hillsborough, Jeff Eakins, said in a missive sent to the evaluators and mentors that he is moving to a different evaluation system, according to this article in the Tampa Bay Times. It says:

Unlike the complex system of evaluations and teacher encouragement that cost more than $100 million to develop and would have cost an estimated $52 million a year to sustain, Hillsborough will likely move to a structure that has the strongest teachers helping others at their schools.

Eakins said he envisions a new program featuring less judgmental “non-evaluative feedback” from colleagues and more “job-embedded professional development,” which is training undertaken in the classroom during the teacher work day rather than in special sessions requiring time away from school. He said in his letter that these elements were supported by “the latest research.”

From the start, critics had warned about using a standardized test designed for one purpose to evaluate something else — a practice frowned upon in the assessment world. The Rand report affirmed those concerns and said problems with using test scores as a metric were significant:

Teacher evaluation was at the core of the initiative, and the sites were committed to using the measures to inform key HR decisions. But, as we described in Chapters Three through Eight, the sites encountered two problems related to these intended uses of the TE measures. First, it was difficult for the sites to navigate the underlying tension between using evaluation information for professional improvement and using it for high-stakes decisions. Second, some sites encountered unexpected resistance when they tried to use effectiveness scores for high-stakes personnel decisions; this occurred despite the fact that the main stakeholder groups had given their support to the initiative in general terms at the outset.

The findings revive questions about whether the country is well-served when America’s wealthiest citizens choose pet projects and fund them so generously that public institutions, policy and money follow — even if those projects are not grounded in sound research. Such concerns have been raised most often about Gates, because he is the largest education philanthropist by far, and because he was a key player in Obama administration education reforms.

Gates, though, was pushing his own ideas for school reform before Obama became president, and he has since acknowledged that none of them turned out as well as he had hoped. In 2014, he gave a nearly hour-long interview at Harvard University, saying, “It would be great if our education stuff worked, but that we won’t know for probably a decade.”

In 2000, his foundation began investing in education reform with an expensive effort to turn big dropout high schools into smaller schools, which he abandoned, writing in his foundation’s 2009 annual letter that the results had been unimpressive. Instead, he said he would focus on teacher effectiveness and the dissemination of best teaching practices. He spent hundreds of millions of dollars to help create and implement the Common Core State Standards, which became highly controversial.

Now, Rand has declared his massive teacher effectiveness project to have fallen short of his goals. The Rand report does say that “the initiative did produce benefits, and the findings suggest some valuable lessons for districts and policymakers.” What lessons? Well, the report’s authors say some teachers reported learning how to improve from the observations. They also said the project had succeeded in helping schools “measure effectiveness” but not how to “increase it.” Of course, that is a loaded finding, given that there are many definitions of “effectiveness.”

Some school reformers are reluctant to say the project was a waste of time and money. They say the project taught us what doesn’t work. That ignores the fact that some education experts warned from the start that some of the premises on which it rested were not sound.

The bottom line: School reformers, led by Gates and supported by Duncan, felt the need to spend $575 million to prove their critics right.