Perhaps the key struggle between Democrats and Republicans over the rewriting of No Child Left Behind will be about annual standardized testing.
Sen. Lamar Alexander, the Republican from Tennessee who just became chairman of the Senate education committee, says he is determined to get a bill rewriting No Child Left Behind to the Senate floor by the end of February. That would be something of a feat, given that NCLB was signed into law in 2002 and was supposed to be rewritten in 2007. Congress has been unable to get the job done since then.
In new draft legislation, Alexander is offering two possibilities regarding annual standardized testing: maintaining the NCLB federal mandate of standardized testing in grades 3 to 8 and once in high school, or allowing school districts to decide.
The Obama administration — and Sen. Patty Murray, the key Democrat on the Senate education committee — want the federal mandate to remain.
In the following post, Bruce D. Baker, a professor in the Graduate School of Education at Rutgers, the state university of New Jersey, looks at the testing debate and attempts to separate what makes sense from what doesn’t. With Baker’s permission, here is a version of the original, which appeared on his blog School Finance 101.
By Bruce D. Baker
It continues to blow my mind that many engaged on the pro-annual testing side of the debate see the annual testing of all children in all grades as the one and only method of achieving all of the things testing, in their view, is intended to achieve, including:
- school and local education agency accountability (e.g. imposing “death penalties” on those “failure factories”!)
- individual student accountability (e.g. making sure that kid who missed on additional question on the state test doesn’t graduate)
- teacher accountability (e.g. firing those teachers who don’t show year over year gains on test scores, as estimated with value-added models)
- school level “data driven” leadership (e.g. leveraging “cage busting” leadership to achieve the pinnacle of awesomeness)
The presumption is that a single method of testing – testing everyone every year in every subject – is the appropriate – the only method to accomplish all of these tasks, simultaneously. We can’t possibly make sure no child is left behind if we don’t test them all annually every year. And we can’t possibly point the finger of blame for a child being left behind if we don’t test them all every year, and link those testing data to their teachers and schools!
The presumption goes further: We can’t possibly ensure that all children are “college ready” unless we can show that each and every one of them receives near the end of their high school year, a test score on a common assessment (PARCC or Smarter Balanced, the two new Common Core tests developed by multi-state consortia with federal funds) that is reasonably predictive of achieving a combined score of 1,550 or higher on the SAT! And, the thinking apparently goes, this entire system must be built on a set of common national standards if ever we are to make valid comparisons of the quality of schooling from Tennessee to Massachusetts, or the effectiveness of individual teachers from the Bayou to Battle Creek.
The counter argument at this point seems to favor the complete abandonment of yearly assessment, and common standards all together — reverting to a hodgepodge of state and local curriculum, standards and assessments.
Missed in most of the conversation are the valid, relevant uses of student assessments, and the different uses, and approaches to using testing, measurement, large and small scale assessment in our schooling system.
Mixed in with this discussion of late is whether annual testing enhances the civil rights of children, or erodes them.
Here’s a quick run-down on a) the purposes of testing in schools, b) how to implement testing to best address those purposes, c) the right and wrong uses of testing with respect to civil rights concerns, and d) the role of common standards in all of this.
Purposes of Testing (measuring student achievement) in our public schools
While there are potentially many more purposes of assessment in school settings, I boil it down here to:
- testing for diagnostic and instructional purposes
- testing for system monitoring purposes (e.g. accountability)
These two major purposes of testing are best achieved by very different approaches to and uses of testing.
Testing for diagnostic & instructional purposes (Individual)
When it comes to diagnostic testing, for enhancing the instruction of individual children and groups of children – the dynamic teacher/student interaction – we want to implement that testing in a way that allows children to move at their own pace, receive immediate feedback, and provide timely relevant information to teachers on what kids know, what they don’t know, what they’re struggling with, etc. This is fine-grained information, speaking to specific knowledge and skills children are developing, on a day-to-day basis (not from April of one year to April of the next, with feedback the following October).
The logical implementation approach here, given the technologies of testing today, is to have kids engage in assessments along the way, through computer adaptive testing, asynchronous. Not all the kids in a big room of computers taking the same item bank on a given day, but kids progressing through relevant, timely computer adaptive assessments (a few minutes here and there), providing immediate diagnostic feedback to teachers. Plenty of schools already do this kind of stuff, whether effectively or not.
To be clear – I’M NOT TALKING ABOUT THIS BEING THE PRIMARY INSTRUCTIONAL MODEL ITSELF! – OR DOMINANT DAY-TO-DAY CLASSROOM ACTIVITY. I’m talking about this being an available tool, used appropriately to help teachers figure out what kids are getting and what they are not (recognizing that teachers have many other tools at their disposal … like actually asking questions and listening to kids).
This information should NOT be used for “accountability” purposes. It should NOT be mined/aggregated/modeled to determine at high level whether institutions or individuals are “doing their jobs,” or for closing schools and firing teachers. That’s not to say, however, that there might not be some use for institutions (schools districts) mining these data to determine how student progress is being made on certain concepts/skills across schools, in order to identify, strengths and weaknesses. In other words, for thoughtful data informed management. Current annual assessments aren’t particularly useful for “data informed” leadership either. But this stuff could be, given the right modeling tools.
This is the approach we use to ensure that no child is left behind. By the time annual, uniform, standardized assessment data are returned in relatively meaningless aggregate scores to the front office 6 months down the road, those kids have already been left behind, and the information provided isn’t even sufficiently fine grained as to be helpful in helping them to catch up.
Testing for accountability/System Monitoring (Institutional)
When it comes to testing for system monitoring, where we are looking at institutions and systems rather than individuals, immediate feedback is less important. Time intervals can be longer, because institutional change occurs over the long haul, not from just this year, to next. Further, we want our sampling – our measurements – to be as minimally intrusive as possible – both in terms of the number of times we take those measurements, and in terms of the number of measurements we take at any one time. In part, we want measurement for accountability purposes to be non-intrusive so that teachers and local administrators, and the kids especially, can get on with their day – with their learning – development of knowledge and skills.
So, when it comes to “system monitoring” the most appropriate approach is to use a sampling scheme that is minimally sufficient to capture, at point in time, achievement levels of kids in any given school or district (Institution). You don’t have to test every kid in a school to know how kids in that school are doing. You don’t have to have any one kid take an entire test, if you creatively distribute relevant test items across appropriately sampled kids. Using sampling methods like those used in the National Assessment of Educational Progress can go a long way toward reducing the intrusiveness of testing while providing potentially more valid estimates of institutional performance (how well schools and districts are doing).
If we want to know the physical health of a school’s student population, we don’t make them walk around all day with thermometers hangin’ out (or perhaps these days, with a temporal scan duct-taped to their heads). Rather, we might appropriately sample, in time, and across children.
This testing process could be done annually, to result in annual reports on school performance. These annually collected data, if sampled appropriately (using relevant statistical imputation methods), could also be used to estimate gains achieved by children attending specific schools. I would assert that even annual universe data (all kids tested every year) are of minimal value for assigning useful, reliable, or valid “effect” measures to individual teachers.
Here’s the really important part, which also relates to my thermometer example above. The testing measures themselves ARE NOT THE ACTIONABLE INFORMATION. Testing provides information on symptoms, not causes or underlying processes. It is pure folly to look at low test scores for a given institution, and follow up with an action plan to “improve test scores,” or close the school if/when test scores don’t improve, without ever taking stock of the potential causes behind the low test scores. TEST SCORES ARE SYMPTOMS, NOT CAUSES, NOT ACTIONABLE IN AND OF THEMSELVES.
Where testing for system monitoring purposes reveals gaps between groups of students, or low performance in specific sets of schools, our first course of action should be to dig into underlying processes and inputs. Do these low performing schools have equitable resources to meet their children’s needs? If we find that they don’t – that these lower performing schools serve far more children with greater educational needs, have burgeoning class sizes, non-competitive teacher compensation, then we’ve got something actionable-resource disparities to address, at least as a first course of action.
Further, testing data of this type or the diagnostic type are ALWAYS UNCERTAIN – that is, the difference between the 49th and 51st percentile may not be a difference at all. So we shouldn’t call it one! We shouldn’t draw lines in this sand, or apply bold, disruptive consequences to distinctions that in fact may be statistically meaningless!
Testing as a Civil Rights Issue?
How does all of this relate to the recent discussion of whether the presence of annual testing enhances or erodes children’s civil rights, particularly those of disadvantaged minority groups? Well, it all depends on how that testing is used. Used correctly, implemented appropriately, testing for system monitoring purposes is vital to the protection of civil rights. Used inappropriately, as has often been the case, testing can violate children’s civil rights.
As someone who engages in expert witness work evaluating the equity and adequacy of state education systems, testing information is useful to me in exploring disparities in children’s outcomes that may raise civil rights concerns.
But again, as noted above, the key here is to recognize that testing outcomes are potential indicators of input or opportunity disparities. Testing outcomes themselves are NOT the disparities of interest to which policy leverage can be directly applied. That’s just dumb. One does not fix achievement gaps by setting the goal “fix that achievement gap!”
That said, without testing, we might no longer have available reliable and valid evidence that those gaps persist.
There are certainly cases where the common misuses of testing raise serious civil rights concerns. For example:
- Applying strict cut scores at the individual level (high stakes exams) that sort and exclude children disproportionately by race and income, while never addressing input/opportunity disparities that might be the cause of disparate outcomes.
- Applying strict cut scores at the institutional level to lay blame on teachers and their institutions for the disproportionate failure of low income and minority children, while never addressing input/opportunity disparities that might be the cause of disparate outcomes.
Sadly, I’d say that these two abuses of testing data are far more common than the appropriate uses I outline above. We have, for the past decade and half, escalating in recent years, made policy determinations on test scores alone – taking action on test scores alone – never using those test scores to explore underlying causes – and in the process, we have disproportionately limited high school graduation and college matriculation options of poor and minority children, and have disproportionately closed schools based on symptoms not causes, of poor and minority children.
The bad has far outweighed the good in existing policy uses of testing data!
On Common Standards
Finally, about those common standards. For me, the greatest potential virtue of common standards across states, accompanied by a least intrusive system for assessing those standards (as addressed above) is that we might finally get a better handle on the relative adequacy of resources available to children across states. We might then be able to impose some pressure on those states that have arguably thrown their entire public school system under the bus, to invest sufficiently to achieve those standards. For years, for example, Tennessee has spent next to nothing on their public schools and set low enough outcome standards that all still appeared just fine (unless, of course, you look at NAEP, instead of pass rates on their own tests). Yes, this is hugely wishful thinking!
But, without common standards, we can’t even begin to measure the costs of achieving those common standards across settings.