wpostServer: http://css.washingtonpost.com/wpost

The Post Most: Local

Answer Sheet
Posted at 04:00 AM ET, 11/04/2011

NAEP: A flawed benchmark producing the same old story

This was written by James Harvey, executive director of the National Superintendents Roundtable. Harvey, who helped write the seminal 1983 report “A Nation at Risk,” is the author or co-author of four books and dozens of articles on education and has been examining the history of NAEP as part of his doctoral studies at Seattle University.

By James Harvey

The latest results from the National Assessment of Educational Progress were released this week and can be summarized quickly: New NAEP numbers tell the same old story. Fourth- and eighth-grade students have inched ahead in mathematics but only about one third score at the proficient or higher level in reading.

Proficiency remains a tough nut to crack for most students, in all subjects, at all grade levels. NAEP routinely reports that only one third of American students are proficient or better, no matter the subject, the age of the students, or their grade level. But no one should be surprised.

NAEP’s benchmarks, including the proficiency standard, evolved out of a process only marginally better than throwing darts at the wall.

That’s a troubling conclusion to reach in light of the expenditure of more than a billion dollars on NAEP over 40-odd years by the U.S. Department of Education and its predecessors. For all that money, one would expect that NAEP could defend its benchmarks of Basic, Proficient, and Advanced by pointing to rock-solid studies of the validity of its benchmarks and the science underlying them. But it can’t.

Instead, NAEP and the National Assessment Governing Board that promulgated the benchmarks have spent the better part of 20 years fending off a consensus in the scientific community that the benchmarks lack validity and don’t make sense. Indeed, the science behind these benchmarks is so weak that Congress insists that every NAEP report include the following disclaimer: “NCES [National Center for Education Statistics] has determined that NAEP achievement levels should continue to be used on a trial basis and should be interpreted with caution” (emphasis added).

Proficient Doesn’t Mean Proficient

Oddly, NAEP’s definition of proficiency has little or nothing to do with proficiency as most people understand the term. NAEP experts think of NAEP’s standard as “aspirational.” In 2001, two experts associated with NAEP’s National Assessment Governing Board (Mary Lynne Bourque, staff to the governing board, and Susan Loomis, a member of the governing board) made it clear that:

“[T]he proficient achievement level does not refer to “at grade” performance. Nor is performance at the Proficient level synonymous with ‘proficiency’ in the subject. That is, students who may be considered proficient in a subject, given the common usage of the term, might not satisfy the requirements for performance at the NAEP achievement level.”

Far from supporting the NAEP “proficient” level as an appropriate benchmark for student accomplishment, many analysts endorse the NAEP “basic” level as the appropriate standard.

Criticisms of the NAEP Achievement Levels

What is striking in reviewing the history of NAEP is how easily and frequently its governing board has shrugged off criticisms about the board’s standards-setting processes.

In 1993, the National Academy of Education argued that NAEP’s achievement-setting processes were “fundamentally flawed” and “indefensible.” The Government Accounting Office in 1993 concluded that “the standard-setting approach was procedurally flawed, and that the interpretations of the resulting NAEP scores were of doubtful validity.”

The governing board was so incensed by a report it received from Western Michigan University in 1991 that it looked into refusing to pay the university’s prominent assessment experts before hiring others to take issue with the report’s conclusions.

The governing board absorbed savage criticism from the National Academy of Sciences in 1999. Six years after the National Academy of Education report, the National Academy of Sciences concluded that:

“NAEP’s current achievement level setting procedures remain fundamentally flawed. The judgment tasks are difficult and confusing; raters’ judgments of different item types are internally inconsistent; appropriate validity evidence for the cut scores is lacking; and the process has produced unreasonable results.”

In fact, reported the National Academy of Science panel, “the results are not believable” largely because the NAEP results flew in the face of other evidence. Too few students were judged to be advanced, thought the panel, when measured against other indicators of advanced work, such as completion of Calculus or participation in Advanced Placement.

Fully 50% of 17-year-olds judged to be only basic by NAEP ultimately obtained four-year degrees. Just one third of American fourth graders were said to be proficient in reading by NAEP in the mid-1990s at the very time that international assessments of fourth-grade reading judged American students too rank Number Two in the world.

For the most part, such pointed and critical comments from eminent authorities in the assessment field have rolled off the governing board and NAEP like so much water off a duck’s back.

As recently as late 2009, the U.S. Department of Education received a report on NAEP that it had commissioned from the Buros Institute at the University of Nebraska. The institute is named after Oscar Krisen Buros, the founding editor of Mental Measurements Yearbook. The report noted, “Validity is the most fundamental consideration in developing and evaluating tests.

“The Institute then took NAEP to task for, among other things, lacking a “validity framework,” ignoring any program of organized validation research, unprofessionally releasing technical reports years after NAEP results had been announced to the public, and the fact that “notably absent [are] clearly defined intended uses and interpretations of NAEP.” The Institute went on to recommend:

“… [a] transparent, organized validity framework, beginning with a clear definition of the intended and unintended uses of the NAEP assessment scores. We recommend that NAGB continue to explore achievement level methodologies…. [W]e further recommend that NAGB consider additional sources of external validity [such as] ACT or SAT scores…and transcript studies…to strengthen the validity argument.”

In short, for the last 20 years it has been hard to find any expert not on the U.S. Department of Education’s payroll who will accept the NAEP benchmarks uncritically.

NAEP and International Assessments

The NAEP benchmarks might be more convincing if most students elsewhere could handily meet them. But that’s a hard case to make, judging by a 2007 analysis from Gary Phillips, former acting commissioner of NCES. Phillips set out to map NAEP benchmarks onto international assessments in science and mathematics.

Only Taipei and Singapore have a significantly higher percentage of “proficient” students in eighth grade science (by the NAEP benchmark) than the United States. In math, the average performance of eighth-grade students could be classified as “proficient” in six jurisdictions: Singapore, Korea, Taipei, Hong Kong, Japan, and Flemish Belgium. It seems that when average results by jurisdiction place typical students at the NAEP proficient level, the jurisdictions involved are typically wealthy — many with “tiger mothers” or histories of not enrolling low-income students or those with disabilities.


Complexity and Judgment

None of this is to say that the NAEP achievement levels are entirely indefensible. Like other large-scale assessments (Trends in International Math and Science Survey, the Progress on International Reading Literacy Survey, and the Program on International Student Assessment), NAEP is an extremely complex endeavor, depending on procedures in which experts make judgments about what students should know and be able to do and construct assessment items to distinguish between student responses. Panels then make judgments about specific items and trained scorers, in turn, bring judgment to bear on constructed-response items, which typically make up about 40 percent of NAEP items.

In summary, three important facts about NAEP have been downplayed, ignored, or swept under the rug and need to be acknowledged and addressed.

First, NAEP’s achievement levels, far from being engraved on stone tablets, are administered, as Congress insists, on a “trial basis.” Second, the NAEP achievement levels are inherently based on judgment and not science. While it is not entirely fair to say that this is little better than throwing darts at the wall, it is fair to say that this is little better than educated guesswork. Third, the proficiency benchmark seems reachable by most students in only a handful of wealthy or Asian jurisdictions.

Enough questions exist about these achievement levels that Congress should commission an independent exploration to make sense in straightforward language of the many diverse definitions of proficiency found in state, national, and international assessments. A national assessment that puts proficiency beyond the reach of students throughout the Western world and most of Asia promises not to clarify our educational challenges but to confuse them.

-0-

Follow The Answer Sheet every day by bookmarking http://www.washingtonpost.com/blogs/answer-sheet. And for admissions advice, college news and links to campus papers, please check out our Higher Education page. Bookmark it!

By  |  04:00 AM ET, 11/04/2011

 
Read what others are saying
     

    © 2011 The Washington Post Company