Mayfield High School junior Laura Cruz holds a sign in 2015 during a student protest of the PARCC Common Core exams in Las Cruces, N.M. (Robin Zielinski/AP/Las Cruces Sun-News)

If you think that determining scores on standardized tests is a simple matter of figuring out how many answers each student got right, you are wrong. In fact, scores are derived through statistical models and scaling practices that can be misleading about student achievement — and this can have an effect on education policy, according to a newly released paper.

The study, titled “Student Test Scores: How the Sausage Is Made and Why You Should Care,” was written by Brian A. Jacob, a professor at the University of Michigan and a nonresident senior fellow at the nonprofit Brookings Institution in D.C. Jacob explains the sophisticated and complex way test scores are determined and then details why they are misleading to people — which includes pretty much everybody except psychometricians.

This includes those federal and state policymakers who don’t understand how the scores are achieved and/or their limitations but still have elevated standardized tests to an all-important measure of how much students have learned in school and how well their teachers are doing their jobs.

The paper is the latest in a series of reports over years that have urged caution in the use of standardized test scores to make high-stakes decisions about students, teachers, principals and schools — but policymakers at the federal and state levels have for years ignored the warnings.

For example, in 2014, the American Statistical Association issued a statement saying that value-added modeling for teacher evaluation was unfair and invalid because value-added scores do not directly measure all potential teacher contributions and that they typically measure co-relation and not causation. In the last year or so, there has been some recognition among policymakers that students are taking too many high-stakes tests and that there is a problem with using student test scores to evaluate educators, but some states still use value-added as an evaluation method.

In his paper, Jacob questions policy that relies heavily on test scores, which depend on various decisions made by test designers. For example, he said that the simple choice of which of three “parameter” models to use in modern test development can “make a sizable difference for extremely high- or low-performing students.”

In recent years, many districts have started to use measures of teacher value-added as part of its determination of promotion, tenure, and even compensation. A teacher’s “value-added” is based on how much improvement his or her students make on standardized tests during the school year (sometimes adjusted for various demographic characteristics). A teacher whose students grew by, say, 15 points is considered more effective than a teacher whose students only grew 10 points. However, if the students in these classrooms started from a different baseline, then this type of comparison depends entirely on the scaling of the exam. For example, it might be the case that a teacher who raises the scores of low-achieving students by 10 points has provided the students more than her colleague who manages to raise the scores of higher-achieving students by 15 points.

This problem is becoming more apparent to more people. The Obama administration has been a big supporter of using student standardized test scores to evaluate educators, but the 2016 platform of the Democratic Party says, for the first time ever, that using test scores for high-stakes decisions is a bad practice:

We oppose high-stakes standardized tests that falsely and unfairly label students of color, students with disabilities and English Language learners as failing; the use of standardized test scores as a basis for refusing to fund schools or to close schools; and the use of student test scores in teacher and principal evaluations, a practice that has been repeatedly rejected by researchers.

Jacob wrote:

Contrary to popular belief, modern cognitive assessments — including the new Common Core tests — produce test scores based on sophisticated statistical models rather than the simple percent of items a student answers correctly. While there are good reasons for this, it means that reported test scores depend on many decisions made by test designers, some of which have important implications for education policy. For example, all else equal, the shorter the length of the test, the greater the fraction of students placed in the top and bottom proficiency categories — an important metric for state accountability. On the other hand, some tests report “shrunken” measures of student ability, which pull particularly high- and low-scoring students closer to the average, leading one to understate the proportion of students in top and bottom proficiency categories. Shrunken test scores will also understate important policy metrics such as the black-white achievement gap — if black children score lower on average than white children, then scores of black students will be adjusted up while the opposite is true for white students.

The scaling of test scores is equally important. Despite common perceptions, a 5-point gain at the bottom of the test score distribution may not mean the same thing in terms of additional knowledge as a 5-point gain at the top of the distribution. This fact has important implications for the value-added based comparisons of teacher effectiveness as well as accountability rankings of schools. There are no easy solutions to these issues. Instead there must be greater transparency of the test creation process, and more robust discussion about the inherent trade-offs about the creation of test scores, and more robust discussion about how different types of test scores are used for policymaking as well as research.

You can read the whole paper here.