The National Assessment of Educational Progress is often referred to as “the nation’s report card” or the “gold standard” in student assessment because it is seen as the most consistent nationally representative measure of U.S. student achievement since the 1990s and because it is supposed to be able to assess what students “know and can do.”

It is administered every two years to groups of U.S. students in the fourth and eighth grades, and less frequently to high school students. (The test-takers are said to be randomly chosen within selected schools.)  Tests are given every two years in math and reading and less frequently in science, writing, the arts, civics, economics, geography, technology and engineering literacy, and U.S. history.

NAEP results are highly anticipated in the education world and often seen as a benchmark for progress in school systems — even though the results are often misinterpreted (which you can read about here). When students score at the “proficient” level on NAEP, many take that to mean they are “proficient” at their grade level, but that isn’t the case, and the mistake makes for bad analysis of the results.

NAEP supporters say that the tests are able to measure skills that other standardized tests can’t: problem solving, critical thinking, etc. But this post takes issue with that notion. It was written by three Stanford University academics who are part of the Stanford History Education Group: Sam Wineburg, Mark Smith and Joel Breakstone.

Wineburg, an education and history professor in the Graduate School of Education, is the founder and executive director of the Stanford History Education Group and Stanford’s PhD program in education history. His research interests include assessment, civic education and literacy. Smith, a former high school social studies in Iowa, Texas and California, is the group’s director of assessment; his research is focused on K-12 history assessment, particularly on issues of validity and generalizability. And Breakstone, a former high school history teacher in Vermont, directs the Stanford History Education Group. His research focuses on how teachers use assessment data to form instruction.

The three led the development of the group’s assessment website, Beyond the Bubble.

By Sam Wineburg, Mark Smith and Joel Breakstone

It’s known as the “nation’s report card.” The National Assessment of Educational Progress (NAEP) is a gauge of achievement in a country suspicious of federal intrusions into education. Not only has NAEP survived suspicion, but it has seen its budget swell from less than a half a million dollars in 1968 to last year’s appropriation of $149 million.

NAEP has powerful backers. Sen. Lamar Alexander (R-Tenn.), who chairs the Senate Education Committee and who was secretary of education under President George H.W. Bush, touts NAEP as the “most reliable instrument in elementary and secondary education,” essential in “history, civics, and geography,” subjects often ignored by individual state assessments.

Students have never fared well on NAEP’s tests in these subjects. The first history test in 1987 found that half of the students couldn’t place the Civil War in the right half-century. Some 15 years later, following a decade of new standards, The Washington Post wrote that students on the 2001 exam “lack even a basic knowledge of American history.” In 2014, the last time history was tested, the New York Times fished into the recycling bin for this headline: “Most Eighth-Graders Score Low on History, Civics.”

But what would happen if instead of grading the kids, we graded the test makers? How? By evaluating the claims they make about what their tests actually measure.

For example, in history, NAEP claims to test not only names and dates, but critical thinking — what it calls “Historical Analysis and Interpretation.” Such questions require students to “explain points of view,” “weigh and judge different views of the past,” and “develop sound generalizations and defend these generalizations with persuasive arguments.” In college, students demonstrate these skills by writing analytical essays in which they have to put facts into context. NAEP, however, claims it can measure such skills using traditional multiple-choice questions.

We wanted to test this claim. We administered a set of Historical Analysis and Interpretation questions from NAEP’s 2010 12th-grade exam to high school students who had passed the Advanced Placement (AP) exam in U.S. History (with a score of 3 or above). We tracked students’ thinking by having them verbalize their thoughts as they solved the questions.

What we learned shocked us.

In a study that appears in the forthcoming American Educational Research Journal, we show that in 108 cases (27 students answering four different items), there was not a single instance in which students’ thinking resembled anything close to “Historical Analysis and Interpretation.” Instead, drawing on canny test-taking strategies, students typically did an end run around historical content to arrive at their answers.

One of NAEP’s “Analysis and Interpretation” questions on the 2010 test was about the 14th Amendment to the Constitution:

All persons born or naturalized in the United States … are citizens of the United States and of the State wherein they reside. No State shall make or enforce any law which shall abridge the privileges or immunities of citizens of the United States; nor shall any State deprive any person of life, liberty, or property, without due process of law; nor deny to any person … equal protection of the laws.

Q: This amendment has been most important in protecting the

a) right of communities to control what goes on in their schools
b) rights of foreigners living in the United States
c) rights of individual citizens of the United States
d) right of the government to keep secrets for reasons of national security

If students had analyzed or interpreted, they would have spoken about the context surrounding the 14th Amendment — the fury of white Southerners at the passage of the 13th Amendment abolishing slavery. Or they would have written about the institution of “Black Codes” to limit the rights of former slaves or the refusal of Southern states to ratify the amendment. From 17-year-old Jonathan, we heard little about context and a lot about test taking:

The question talks about persons born or naturalized in the U.S., and they’re talking about their rights. So (b) is pretty clearly not right because that talks about foreigners, which would apply to somebody who is visiting and probably not naturalized … Then (a) and (d) are talking about either communities controlling their schools or government keeping secrets for natural security. Those things just aren’t even addressed in the text at all. And then (c) is the last one left, and rights of individual citizens is definitely hit on in this, so that one makes the most sense.

Asked how he got to the right answer, Jonathan expounded on what astute test takers do when the gauge points to empty: “You just sort of ‘logic it out’ from the things you’re given instead of actually having to know what the 14th Amendment is.” The only “analysis” here is how to psych out a multiple-choice test.

Another “Analysis and Interpretation” question asked about Shays’s Rebellion, in which Daniel Shays led a crew of disgruntled Revolutionary War veterans on a raid of the federal arsenal in Springfield, Mass. Sixteen-year-old Jenna taught us that it’s even possible to get the right answer by plunking down historical events in the middle of the wrong century.

Q: Shays’ Rebellion (1786) was important because it

a) led many people to believe that the central government was too weak
b) led to the end of public support for the First Bank of the United States
c) made many people fear the tyranny of the President more than the tyranny of England
d) convinced many people in the North that slavery should be expanded to new territories

Most of the 27 students solved the question by recalling facts straight from their AP course. Jenna instead boomeranged between the 17th and 19th centuries before finally landing in the middle of the Civil War:

Shays’s Rebellion, I get mixed up with Bacon’s Rebellion. I think it’s either about slavery or about the government. I think it’s either (A) or (D). I think it was to stop slavery because the slaves were the people, like, trying to stop it. So, I don’t think it’s to expand slavery (D). So not (C) or (D). I think it’s either (A) or (B), but I don’t remember much about the First Bank of the United States. I don’t think it was very popular. I know that the South didn’t have a good central government. So, maybe, yeah, I think it’s (A).

In this temporal hall of mirrors, Shays’s Rebellion results from the failed policies that Jefferson Davis brought to Richmond in 1861, which led Southerners to “believe that the central government was too weak.” The optical scanner reading Jenna’s “correct” response would never know the difference.

Frederick J. Kelly invented the first multiple-choice questions in 1914, the same year Archduke Franz Ferdinand was gunned down in Sarejavo. Kelly’s items look eerily familiar to anyone who’s ever sat in an American classroom:

“Below are given the names of four animals. Draw a line around the name of each animal that is useful on the farm: cow tiger rat wolf.”

But it wasn’t until the 1930s that this new form of testing really took off. A physics teacher from Ironwood, Mich., used the electric current transmitted by a pencil’s graphite to create what he called the “lazy teacher’s gimmick,” the Markograph. Once Reynold B. Johnson sold the rights to his machine to IBM for $15,000 in 1934, testing would never be the same.

In the years since, bubble tests have become a fixture of American schooling. We’ve convinced ourselves that they can do everything — even measure sophisticated thinking processes such as “Historical Analysis and Interpretation.” As our study suggests, confidence in the multiple-choice question as a one-stop shop for all kinds of assessment is misplaced.

This year, NAEP will retire the No. 2 pencil and go paperless. Students will soon use 21st-century digital technology to answer multiple-choice questions that, in their basic form, haven’t changed much since Kelly administered the “Kansas Silent Reading Test.”

In the intervening century, technology has transformed how we shop, bank, choose a mate — even how we arrange care for a sick relative. But one thing remains unchanged: We’re still teaching kids the lesson that problem solving is about plucking a single answer from four orderly alternatives.

But orderly is exactly what their world is not. If they become informed about their world, our future voters are on their phones, flipping through their Facebook, Snapchat and Twitter feeds. Never have the critical skills historians associate with “Historical Analysis and Interpretation” been more crucial to teach — and to assess.

In announcing the move to a paperless NAEP, William Bushaw, executive director of its governing board, hailed the fact that the test could be delivered digitally, calling NAEP “the gold standard of student achievement.” But as long as NAEP claims to measure critical thinking using an item format designed to see if kids can tell the difference between a cow and a rat, it’s no gold standard.

It’s fool’s gold.