Putting Assessments to the Test

By Valerie Strauss
Washington Post Staff Writer
Monday, March 26, 2007

One in an occasional series looking at the culture of testing.

No Child Left Behind, President Bush's signature education law, requires that millions of students across the country be tested annually and that the tests produce "reliable and valid" data to measure how well they -- and their schools -- are doing.

Testing experts say that one part of that equation is fairly easy to do, but the other . . . not so much.

Reliability essentially means that a test is, well, reliable; perfect reliability would mean that a student performs the same way on a test every time it is given. Things get in the way -- including the health or frame of mind of the test-taker, the sampling of content on the test and scoring errors -- but it is possible to quantify those mistakes and put error bands around a score that say how much it might vary.

Many of the standardized tests being used can be considered reliable, experts say. But reliability alone doesn't mean much, said Bob Schaeffer, public education director of the National Center for Fair and Open Testing, a nonprofit group that advocates against standardized testing.

"If you got on a scale, and every time you got on, it said it was 237 pounds, it would be reliable, even if you weighed 120," he said. "You could rely on it to say 237 pounds. But it's not accurate or meaningful."

And that's where the problem with validity comes into play, some educators say.

Broadly, experts say, a valid test is one that measures what its authors say it will measure. Tests assess children in many different areas; validity is all about the specific purpose of the test.

"A test itself is not valid or invalid," said Daniel Koretz, a professor of education at Harvard University. "The conclusion you base on the result is valid or invalid."

That means, for example, that under the standard of validity:

· A test designed to screen students for learning disabilities is not used to measure student progress in reading acquisition.

· A test that says it predicts college performance actually does. The old SAT said it did, but experts said the test had limited ability to predict a student's performance in the first year and none beyond that. The test has been changed and, experts say, does not intend to predict achievement.

· A test is not used to guide curriculum.

"There has been an explosion of mandates for more and more standardized tests with very little evidence to support their use," said Walter Haney of Boston College's Center for the Study of Testing, Evaluation and Educational Policy.

The No Child Left Behind program has ushered in an unprecedented era of high-stakes standardized testing, which has dramatically changed what goes on in classrooms across the United States and caused fierce debate over the approach.

The issue of what the tests actually measure has become more important than ever because the results do, indeed, have high stakes, with jobs of teachers and administrators sometimes riding on the single administration of a test. Many experts say that, in this environment, there should be much more effort to ensure that tests are valid.

"If indeed in the long run No Child Left Behind and the accountability movement is going to really have traction in improving education for kids in the United States, I think it's going to have to subject itself to a serious level of scrutiny," said Robert Pianta, director of the University of Virginia's Center for Advanced Study of Teaching and Learning.

What does validity actually mean in the context of student testing?

Testing experts generally refer to three major areas of validity:

· Content validity deals with, not surprisingly, content. A key component, curricular validity, demands that a test actually cover material in the curriculum (especially important in high school graduation tests.)

· Criterion-related validity includes predictive validity. Gerald Bracey, an educational researcher and author of "Reading Educational Research: How to Avoid Getting Statistically Snookered," said that he does not know of any state that has tried to validate its tests against what happens in the future.

· Construct validity deals with the broad picture of whether a test assesses exactly what it is intended to measure; a science test trying to measure knowledge of geologic time might have questions that are so difficult to understand that what really is being measured is vocabulary and reading skills.

Another form of validity, identified in the 1990s, is "consequential validity," which says that a test's validity is determined by how the results are used. It has the testing world in a verbal brawl because some experts think it is essentially nonsense.

"You can have a good test of, say, mathematics, and have school boards make ridiculous policy decisions based on the scores," Bracey said. "To me, that says nothing about the test."

Complicating matters, educators say, is the fact that the pipeline of newly trained testing experts charged with improving standardized tests is nowhere close to keeping pace with the skyrocketing demand.

Training started falling 25 years ago, and there has been no big resurgence. And the capacity of the commercial sector to produce the vastly increased number of tests has significantly lagged, experts say.

Roger Farr is director of the Center for Innovation and Assessment at Indiana University, a special consultant on testing and assessment to the education company Harcourt, and an author several standardized tests. He said he thinks the country is placing too much emphasis on test results.

"Teach children to read and write well and the . . . tests will take care of themselves," he said. "What we've got to do is know what to teach kids. The goal of education is not coming up with answers. The goal of education is how you find answers."

View all comments that have been posted about this article.

© 2007 The Washington Post Company