Here is the text of a speech (as prepared for delivery) that Education Secretary Arne Duncan gave this week to the American Educational Research Association, meeting in San Francisco.
Duncan addresses growing criticism of high-stakes standardized tests, among other subjects.
Here’s what he said, according to the Education Department’s website:
I’m going to talk at some length today about some of the current controversies over standardized testing and the purposes of assessment. But I want to preface those remarks by saying how much I value compelling education research. The truth is that today educators and policymakers still have a large unmet need for relevant research.
Education researchers can and do play an invaluable role in formulating policy, from preschool to grad school. You are the experts. You are the independent truth-tellers.
Yet the role of the independent expert is not at odds with asking hard questions about the practical implications of your work and assisting practitioners to improve education outcomes. Rigor is necessary but not sufficient. Relevance matters.
In my seven-plus years as CEO of the Chicago Public Schools, and in my four years in this job, I’ve often had occasion to ask basic questions about program effectiveness to which there were few compelling answers.
Today, for example, federal, state, and local governments spend several billion dollars a year on professional development for teachers. Yet we know surprisingly little about the effectiveness and return on investment of professional development. Teachers deserve better, as do taxpayers. Why do we continue to spend so much every year, without a clear sense of efficacy?
Samuel Johnson famously said that “to count is modern practice—the ancient method was to guess.” Sadly, school leaders and educators sometimes too often have to guess when they make education policy.
To make education research more relevant, I would ask you to consider two challenges.
The first is to do a more complete job of asking comparative questions in research and evaluation.
Even in cases where methodologies are imperfect and the evidence is complex, policymakers need the help of researchers to deal with real-world challenges—like designing a new and better generation of assessments, or figuring out what works in expanding high-quality preschool programs.
We need you, the researchers, to answer the question “Which approach works better—this one or that one”—and then we need to move forward informed by your answer.
We need you to assess how reforms compare to the status quo—and we can’t let the perfect become the enemy of the good.
My second challenge would be to remain open to findings that contradict or compel a rethinking of the conventional wisdom.
There’s a long history of robust skepticism in program evaluation and social innovation—so much so that it has even been codified in informal “laws” of evaluation.
We have all heard of the Law of Unintended Consequences. Researchers are also familiar with Campbell’s Law, which is often cited in debates over standardized testing and accountability.
It holds that “the more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures—and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
I’ll talk about Campbell’s Law with regard to testing and accountability in a minute. But let me point out that unintended consequences of social policy are often welcome in the field of education.
Perhaps the biggest social program and expansion of educational opportunity in higher education in the 20th century was the GI Bill, which FDR signed into law in 1944.
But none of the GI Bill’s authors foresaw that the education and training provided in the GI Bill would fundamentally transform not only the lives of the Greatest Generation but community after community in America.
Instead, Congress enacted the original GI Bill to avoid a rerun of the chaos that ensued after World War I, when the return of millions of vets expanded the jobless rolls and prompted both labor strikes and race riots.
In fact, the Mississippi congressman named John Rankin, who led the fight for the GI bill, was a bigot and an ardent defender of segregation. In an irony I love, he totally failed to foresee that he helped launch the civil rights revolution by enabling hundreds of thousands of black veterans to finish their schooling and buy homes.
Long after the GI Bill, what I’ll call the Law of Welcome Surprises continued to transform education and propel innovation.
Consider one more example: In the mid-1970s, a group of Pentagon staffers designed an obscure computer network to enhance communication among defense department projects at research labs and universities and to withstand attack. They failed to anticipate the network’s potential to facilitate research and E-mail. Today, as you know, that computer system has mushroomed into the world-wide Internet.
I would suggest to you that the first term of the Obama administration was filled with important educational advances that few if any experts anticipated.
Forty-six states, plus the District of Columbia, voluntarily came together to design and adopt the Common Core standards—even though shared, internationally-benchmarked standards were supposed to be the third rail of education policy.
With federal support, 44 states plus DC are part of two large state consortia that are designing a new generation of assessments to better measure the higher-order thinking skills so vital to success in a knowledge-based, global economy.
A sea-change is underway in the state of assessment in the U.S. that few predicted in 2009. As Linda Darling-Hammond noted recently, “The question for policymakers has shifted from, ‘Can we afford assessments of deeper learning?’ to, ‘Can the United States afford not to have such high-quality assessments?'”
In the time I have left, I’d like to discuss the challenges I’ve highlighted about asking hard comparative questions and heeding those counterintuitive outcomes, but with special attention to standardized testing and assessment.
I think we can generally agree that standardized tests don’t have a good reputation today—and that some of the criticism is merited. Policymakers and researchers have to listen very carefully—and take very seriously the concerns of educators, parents, and students about assessment.
At its heart, the argument of the most zealous anti-testing advocates boils down to an argument for abandoning assessment with consequences for students, teachers, or schools.
The critics contend that today’s tests fail to measure students’ abilities to analyze and apply knowledge, that they narrow the curriculum, and that they create too many perverse incentives to cheat or teach to the test. These critics want students and teachers to opt out of all high-stakes testing.
The critics make a number of good points—and they express a lot of the frustration that many teachers feel about today’s standardized tests.
State assessments in mathematics and English often fail to capture the full spectrum of what students know and can do. Students, parents, and educators know there is much more to a sound education than picking the right answer on a multiple choice question.
Many current state assessments tend to focus on easy-to-measure concepts and fill-in-the-bubble answers. Results come back months later, usually after the end of the school year, when their instructional usefulness has expired.
And today’s assessments certainly don’t measures qualities of great teaching that we know make a difference—things like classroom management, teamwork, collaboration, and individualized instruction. They don’t measure the invaluable ability to inspire a love of learning.
Most of the assessment done in schools today is after the fact. Some schools have an almost obsessive culture around testing, and that hurts their most vulnerable learners and narrows the curriculum. It’s heartbreaking to hear a child identify himself as “below basic” or “I’m a one out of four.”
Not enough is being done at scale to assess students’ thinking as they learn to boost and enrich learning, and to track student growth. Not enough is being done to use high-quality formative assessments to inform instruction in the classroom on a daily basis.
Too often, teachers have been on their own to pull these tools together—and we’ve seen in the data that the quality of formative tools has been all over the place.
Schools today give lots of tests, sometimes too many. It’s a serious problem if students’ formative experiences and precious time are spent on assessments that aren’t supporting their journey to authentic college- and career-readiness.
Here is an example from my own experience. When I started at CPS in Chicago in 2001, each spring 100,000 students in grades 3 through 8 were required to take the Iowa Tests of Basic Skills. And at the same time, those same students were also required to take the Illinois state assessment.
The Iowa Tests of Basic Skills were not aligned to the Illinois state standards. And none of our students actually lived in Iowa.
So, after my second year on the job, I decided CPS would drop the Iowa Tests. That effectively eliminated 50 percent of the summative tests that our students had to take.
In short, I agree with much of the critique of today’s tests.
Now, the essential question is where do we go from here?
Despite the flaws of today’s tests, we can’t throw the baby out with the bathwater. I don’t believe that the problems of assessing student growth are so unsolvable that we should take a pass on measuring growth—or bar the consideration of student progress in learning from teacher evaluation.
Standardized assessments are still a needed tool for transparency and accountability across the entire education system. We should never, ever return to the days of concealing achievement gaps with school averages, no-stakes tests, and low standards.
The fact is that no one is more damaged by weak accountability measures than our most vulnerable students.
We must reliably measure student learning, growth, and gain.
The solution to mediocre tests is not to abandon assessment. Instead we are supporting the creation of much better assessments, aligned with higher standards, to propel better instruction and assess growth in learning. That is a driving motivation for our Department’s $350 million Race to the Top Assessment awards to two consortia to develop Assessment 2.0.
In recent months, anti-testing advocates have claimed that high-stakes tests drive teachers and school administrators to cheat. But that argument confuses correlation with causation. And it also ignores history. It ignores the compared-to-what question.
There is no excuse for school administrators and teachers tampering with student tests to boost test scores. It is morally indefensible—and it is most damaging to the very students who most desperately need the help of their teachers and school leaders.
But I reject the idea that the system makes people cheat. Millions of educators administer tests but very few chose to cheat. In all but a tiny minority of cases, teachers want their children to genuinely learn and grow—not achieve phony gains to make themselves or their schools look good.
And if a district’s culture is rotten at the core, people must speak out.
As rare and as disturbing as cheating is, it’s not without a history. I don’t think any leader should tolerate cheating—and the public should not tolerate any leader who does.
In Chicago, CPS first confronted the issue of teacher cheating in 1984, when I was in college—nearly two decades before the passage of No Child Left Behind. News reports of cheating appeared in local newspapers—and a small number of schools made unusually large gains on tests or had ordered unusually large numbers of answer sheets.
So in 1985 CPS ordered a retest at 23 suspect schools and 17 comparison schools. Not surprisingly, the suspect schools did much worse on the audited retest than the comparison schools.
As Steven Levitt detailed in his book Freakonomics, he and Brian Jacob subsequently developed a rigorous algorithm to detect cheating by teachers in the Chicago Public Schools by the time I became CEO of the Chicago Public Schools in 2001.
Jacob and Levitt had assembled a massive body of test scores from 1993 and 2000 in grades three through seven, with over 700,000 student-year observations. Their groundbreaking analysis found the possibility of teacher or administrator cheating on standardized tests in three to five percent of elementary school classrooms each year in CPS.
I didn’t want their research to sit on a shelf. So in 2002, I asked Professors Jacob and Levitt to work with CPS to implement a retest and audit in 120 classrooms—a little more than 70 of which were classrooms where cheating was suspected.
Sure enough, in the classes with the highest odds of cheating, the retested students lost more than a full grade in their reading scores on the retest. By cheating, those educators were lying to their students about their readiness to succeed. After further investigation confirmed that a number of teachers had cheated, we fired about a dozen teachers.
Thankfully, Jacob and Levitt’s algorithm provided an unexpected benefit as well—one of those welcome surprises. The algorithm could also identify many of the best teachers in our school system.
Instead of getting random answers correct, the students of good teachers showed real improvement in reading and math on the types of questions they had previously missed, an indication of actual learning. And a good teacher’s students carried over all their gains into the next grade.
In the end, professors Jacob and Levitt concluded that teacher cheating “is not likely to be a serious enough problem by itself to call into question high-stakes testing, both because it is relatively rare . . . and [is] likely to become much less prevalent with the introduction of proper safeguards.”
That’s also my view—but it happens to be consistent with Donald Campbell’s view, the author of Campbell’s Law. Campbell urged policymakers to develop ways of avoiding the potentially corrupting influence of quantitative indicators in decision-making, and he suggested—and I quote—that “the use of multiple indicators, all recognized as imperfect, will alleviate the problem.”
I have said repeatedly and consistently that teacher evaluation should never, ever be based only on test scores. Just as Campbell urged, it should always include multiple, albeit imperfect measures, like principal observation or peer review, performance-based assessments, student work, student surveys, and parent feedback.
I’m not just giving lip service to using multiple measures for accountability. I’ve always been convinced it is the best way to go.
All 35 states we have approved for waivers to the Elementary and Secondary Education Act are required to use multiple measures to evaluate teachers, and 33 of the states are including individual student growth.
States with waivers are also including multiple indicators for school accountability. Twenty-seven states are using their flexibility to include measures that go far beyond the reading, math, and graduation rates required under No Child Left Behind in their accountability systems.
They are looking at participation and performance in advanced coursework. They are taking account of performance in science and other subjects. And they are including measures of college- and career-readiness. Is it more complicated? Of course? But more sophisticated and comprehensive? Absolutely!
In the future, I hope that our accountability systems continue to make better and richer use of multiple indicators.
Student growth, attendance rates, graduation rates and matriculation to college, college persistence, school safety, narrowing achievement gaps—those are just some of the measures that should factor in school performance.
At the same time, I fervently hope that public education never reverts to an era of low standards and low-quality assessments with no accountability. It’s amazing how far we have come—and it makes no sense to go backwards.
That’s a view that most educators share. Al Shanker, for example—the legendary labor leader—was an unabashed supporter of both high-quality and high-stakes assessment.
Shanker believed—and I quote—that “world class standards and curriculum, and assessments based on these standards, will be no more successful than any other reform unless stakes are attached to the assessments.” “Stakes change everything,” Shanker said. “Stakes for kids go right to the heart of what motivates them to work and learn.”
Now, contrary to the myth that one sometimes hears today, the vast majority of high-performing nations have both demanding and very high-stakes assessments.
In fact, gateway exams for postsecondary education in most high-performing countries, and sometimes even for secondary education, often determine whether students will be tracked into vocational and technical training or a baccalaureate track.
The U.S. should never adopt the practice of high-performers who use high-stake tests to track students. I absolutely reject that mindset. But we can learn a great deal about how to do assessment from our high-performing competitors.
Whether it is Singapore’s PSLE and GCE assessments, China’s GaoKao college entrance exam, the French “bac,” South Korea’s CSAT, Germany’s Abitur, or the British A-levels, assessments linked to high standards propel good instruction and higher-order learning around the world.
In virtually all of these high-flying systems, teachers and students spend lots of time preparing and studying for these gateway assessments. In fact, rigorous assessments actually take more time to complete than today’s bubble tests, many of which just measure basic skills.
Yet test preparation for assessments in these nations is not so much time out from learning but rather part of the learning process itself. It provides valuable learning opportunities and feedback for instruction.
High-performing countries tend to have assessments that are worth teaching to—and that is a core aim of the Race to the Top Assessment competition.
The next generation of assessment systems includes diagnostic or formative assessments, not just end-of-the-year summative assessments. The two state consortia must assess student achievement of standards, student growth, and whether students are on-track to being college and career-ready. And the new assessment systems must be effective, valid, and instructionally useful.
As I listen and meet with teachers across the country, I never hear them say that they want to get rid of assessments—or give up on assessing student growth in their classrooms.
In fact, the overwhelming majority of teachers hunger for good assessments that ask students to demonstrate what they have learned—whether it is writing a persuasive essay, solving complex problems, or working collaboratively.
The new assessments from the consortia will be a vast improvement on assessment as it is done today.
The PARCC consortium, for example, will evaluate students’ ability to read complex texts, complete research projects, excel at classroom speaking and listening assignments, and work with digital media.
The Smarter Balanced consortium will assess students using computer adaptive technology that will ask students questions pitched to their skill level, based on their previous answers. And a series of optional interim evaluations during the school year will inform students, parents, and teachers about whether students are on track.
The use of smarter technology in assessments will also change instruction in ways that teachers welcome.
Technology makes it possible to assess students by asking them to design products or experiments, to manipulate parameters, run tests, and record data. Problems can be situated in real-world environments, where students perform tasks or include multi-stage scenarios and extended essays.
I have no doubt that Assessment 2.0 will help educators drive the development of a richer curriculum at the state, district, and local level, differentiated instruction tailored to individual student needs, and multiple opportunities during the school year to assess student learning.
As I have said before, I believe this new generation of assessments—combined with the adoption of internationally-benchmarked, college and career-ready standards—is an absolute game-changer for American education.
But I do not suggest for a minute that the advent of college-ready standards and assessments will bring us to some educational nirvana a few short years from now.
As important as better assessments are, they are not a pedagogical silver bullet. Standards and assessments are only the foundation upon which states and districts will construct high-quality curriculum, meaningful, job-embedded professional development, and all the other pieces that will support teachers preparing to teach to these new standards and students learning at higher levels.
When the two consortia roll out their new assessments in the 2014-15 school year, they will be a work in progress. I’m sure not everything will go according to schedule. There will be glitches. There will be mistakes. But we cannot let the perfect become the enemy of the good.
Assessment 2.0 will need lots of work to get to version 2.1 and 2.2. I expect that states and districts will improve implementation as they learn from pilots and field tests. And teachers will play an absolutely critical role in telling us what works and what doesn’t work.
We are asking an enormous amount of principals and teachers in the next several years. In a relatively short transition, teachers are being asked to teach to much higher standards, to help develop and implement more effective curriculum aligned with those standards, and to prepare all students for more demanding assessments.
It is vital that we provide teachers with the resources and professional development they need to make the transition to college and career-ready standards. They want to teach to these higher standards—they just need support getting there. And it is vital that teachers help shape and re-shape this transformation to world-class standards and assessments.
Despite these challenges, I have great faith that teachers can help to innovate and design solutions in the classroom. Why do I have such confidence? Because they are already doing it.
Most teachers today don’t teach in tested subjects. So many teachers rightly want to know, how can student achievement be a factor in teacher evaluation in non-tested subjects?
It’s a great question—and frankly, some states and districts have not developed smart solutions for evaluating teachers in non-tested grades and subjects.
In one Florida district, the school board, teachers union, and management all agreed that 40 percent of teachers’ evaluations at their K-2 school would be judged by the test scores of fourth and fifth-graders at a different elementary school.
Clearly that needs to be rethought. Evaluations should be used primarily to reshape professional learning.
Yet for every goofy, jerry-rigged solution to these issues, teachers have come up with innovative and rigorous ways to assess teacher performance in non-tested subjects.
In Memphis, arts teachers were understandably frustrated because they were being evaluated based solely on school-wide performance in math and English.
So Dru Davison, a fantastic music teacher and arts administrator, convened a group of arts educators to come up with a better evaluation system.
After Dru’s committee surveyed arts teachers in Memphis, they decided to develop a blind peer review evaluation to assess portfolios of student learning. It has proved enormously popular—so much so that the Tennessee State Board of Education evaluated and approved the student growth portfolio system for use elsewhere in the state.
Three districts, with 500 teachers, have already signed on—and many more districts are expected to adopt the Memphis fine arts evaluation system this summer.
This is how we will constantly improve: Problem-solve, rather than complain; listen, rather than reacting defensively; and commit to action when a great idea is on the table.
In conclusion, I think policymakers, school leaders, educators, and researchers must remain open and committed to dramatically improving assessment.
And we must also remain open to what our best research shows about high-quality assessment—even when the results are unexpected.
In the long run, I believe that Assessment 3.0 will include assessments that do even more to personalize learning, and will accelerate the shift from seat-based learning to competency-based learning.
And in another one of those welcome surprises, the same fiber optic cables that will make it possible for every school to participate in a new generation of assessments in the years ahead—the sooner the better—will also empower educators to make much better and personalized use of assessment in future decades.
Here is a finding from assessment research that may make some students and teachers groan.
Research in the learning sciences, including IES-funded projects, has shown across a range of different content areas that so-called “retrieval testing”—frequent low-stakes quizzing—was actually more effective at boosting student learning in most cases than additional studying.
And finally, testing experts need to further expand the range of assessments in the years ahead by developing better, reliable, and valid assessments of children’s non-cognitive skills. This is the next frontier in assessment research—and it is hugely important to me.
We know from Paul Tough’s outstanding recent book, a multitude of studies, and James Heckman’s analysis of the Perry Preschool Project, that the development of skills like grit, resilience, and self-regulation early in life are essential to success later in life.
I would love to see assessment experts work with schools and districts to develop more reliable, meaningful, and easy-to-administer assessments that help us understand whether we are teaching the non-cognitive skills that predict students’ success in college, careers, and life.
When I ran an “I Have a Dream” program back in Chicago for six years in the 1990s, we spent a lot of time and energy trying to help our children gain these skills. But today, I still cannot honestly tell you whether we were successful in our efforts or not.
IES is currently funding a project involving 535 children, in 58 preschool classrooms in Tennessee, to develop a teacher rating scale and a direct assessment of children’s learning related to self-regulation skills. It’s a great start—but we still have a long way to go in assessing these so-called “soft skills” that are actually anything but soft.
Ultimately, a great education involves much more than teaching children simply to read, write, add, and subtract.
It includes teaching them to think and write clearly, and to solve problems and work in teams. It includes teaching children to set goals, to persist in tasks, and to help them navigate the world.
Education comes from the Latin word educere—which means “to lead forth.”
A new generation of high-quality assessments must be a cornerstone of America regaining its educational leadership. And researchers, with rigor and relevance, must help lead forth that effort.
The truth is, all of us must “lead forth” on education. Our education system today stands on the cusp of a great transformation to higher standards, world-class assessments, and more engaging, meaningful curriculum.
With your expert analysis, with your commitment to working with teachers and practitioners, please help us make that transformation a success for all children.