The study is titled “Predictive Validity of MCAS and PARCC: Comparing 10th Grade MCAS Tests to PARCC Integrated Math II, Algebra II, and 10th Grade English Language Arts Tests,” and the authors discussed their work in an article in the newest edition of the Education Next journal. The article says:
Ultimately, we found that the PARCC and MCAS 10th-grade exams do equally well at predicting students’ college success, as measured by first-year grades and by the probability that a student needs remediation after entering college. Scores on both tests, in both math and English language arts (ELA), are positively correlated with students’ college outcomes, and the differences between the predictive validity of PARCC and MCAS scores are modest. However, we found one important difference between the two exams: PARCC’s cutoff scores for college-and career-readiness in math are set at a higher level than the MCAS proficiency cutoff and are better aligned with what it takes to earn “B” grades in college math. That is, while more students fail to meet the PARCC cutoff, those who do meet PARCC’s college-readiness standard have better college grades than students who meet the MCAS proficiency standard.
Here’s a post challenging the results of the study, by William J. Mathis, managing director of the National Education Policy Center at the University of Colorado in Boulder, a member of the Vermont Board of Education and a former Vermont superintendent. The views expressed here are his own and do not reflect the views of any group with which he is associated. Following the post is a response from Mathematica.
By William J. Mathis
“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.” — “Through the Looking Glass,” by Lewis Carroll
The PARCC tests have been criticized for being administered in high-stakes circumstances before they were validated. PARCC’s rejoinder is they had content validity, meaning that the test was built according to their committee reviewed specifications. But what is missing is predictive validity. That is, does the test validly measure the much vaunted touchstone criteria of “College and Career Ready?” After all, that is the entire rationale for the testing emphasis in schools.
Unfortunately, both the PARCC and SBAC tests were administered to the nation’s schoolchildren without a single empirical study demonstrating the tests actually had the predictive capability they claimed.
This barren landscape changed with the release of a report by Mathematica in October. It didn’t get much attention, perhaps because it was clouded by a haze of acronyms and a comparison of PARCC tests with Massachusetts state tests. There it lay until the study was covered by a pro testing think tank. This prompted PARCC CEO Laura Slover, to call attention to this lonely and neglected report.
Hyping the validity of the tests, Ms. Slover tweeted:
She then goes on to quote Mathematica’s Ira Nicholas-Barrer, “It’s a strong signal that in terms of that aspect of what PARCC was designed to do – to give a strong indication of college readiness – it succeeded in doing that.”
These claims were based on a comparison study of PARCC, the Massachusetts MCAS test and college freshman grade point average (GPA). Unfortunately, a casual look at the Mathematica “predictive validity” study shows that the PARCC is simply not a valid measure of “college and career readiness.” Let’s examine the words of the report itself:
These correlations between test scores and GPA are modest in size, ranging from 0.07 to 0.40. The highest correlations are found among the math tests; for instance, the correlation . . . between math GPA and PARCC (math scores) are 0.37 and 0.40, respectively. The correlations between the ELA tests and adjusted ELA GPA are lower, ranging from 0.13 to 0.26. (p. 11).
Comparing college GPA as the measure of “college readiness” with the test scores, the variables were statistically correlated. Correlation coefficients run from minus one (perfect inverse relationship) to zero (no relationship) to 1.0 (perfect relationship). How much one measure predicts another is the square of the correlation coefficient. For instance, taking the highest coefficient (0.40), and squaring it gives us .16. This means the PARCC tests predicted 16 percent of first-year college GPA. We have no idea what accounted for the missing 84 percent. If we take the lowest coefficient reported above (0.07), then only one half of one percent of the variance is predicted. That leaves 99.5 percent unaccounted for. The report calls these coefficients “modest.” That’s a Humpty-Dumpty case of saying words mean what you want them to mean. “Weak” or “non-existent” might fit better.
Given the study sample was college freshmen taking high school tests suggests the actual coefficients would be even lower than these unimpressive results if we followed real high school kids for three years.
In what Mathematica labels “key findings,” the report says,
Both the MCAS and the PARCC predict college readiness. Scores on the assessments explain about 5 to 18 percent of the variation in first-year college grades…” [Emphasis in the original]
This leads the reader into giving undue weight to a (very weak) prediction. With just as much accuracy, they could have said “the PARCC test cannot explain between 82-95 percent of college GPA. Thus, it cannot validly predict college readiness.”
At this point, the reader might be wondering whether these down in the dirt statistics are acceptable in the weedy world of psychometrics. A tour through the literature shows that predictive validity coefficients are quite low in general and commonly run in the 0.30’s. One conclusion is that the PARCC is just about as good as any other test — which is the report’s finding in regard to the MCAS. On the contrary, the more correct conclusion is that standardized tests can predict scores on other standardized tests (which this report confirms) but it cannot validly predict college readiness at any meaningful level.
Perhaps in trying to explain these numbers, the report says, “Predictive validity is only one consideration that is relevant to the state’s selection of an assessment system.” Yet, in a standards-based world, this is an essential, non-negotiable consideration and our test makers clearly didn’t meet standards.
With such low predictability, there are huge numbers of false positives and false negatives. When connected to consequences, these misses have a human price. This goes further than being a validity question. It misleads young adults, wastes resources and misjudges schools. It’s not just a technical issue, it is a moral question. Until proven to be valid for the intended purpose, using these tests in a high stakes context should not be done.
This gets us back to the meaning of the word, validity.
“The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master — that’s all.” — “Through the Looking Glass,” by Lewis Carroll
Here’s the response from Mathematica, written by Ira Nichols-Barrer and Brian Gill at Mathematica Policy Research:
We appreciate Mr. Mathis’ interest in our study. The study was commissioned by the Commonwealth of Massachusetts to examine whether the PARCC exam outperformed the state’s existing standardized test (the MCAS) in predicting students’ success in college, to inform a choice between the two assessments that had to be made last fall. In our view this is an encouraging example of evidence-based policymaking working the way it should: the state faced a decision about which testing system to use, and commissioned independent, rigorous research to inform the decision.Mr. Mathis is correct in indicating that most of the variation in college grades is not explained by the test scores. That is true of any test. As we note in our report, the PARCC does as well as the SAT and as well as the pre-existing Massachusetts assessment, which was widely regarded as one of the best state assessments. Whether this is good enough is a matter worthy of debate, but it was not what the study was designed to determine. We certainly agree that there is room for improvement in the assessments.The statement quoted by Mr. Mathis (and previously by Ms. Slover) was referring to the cut scores set by the PARCC as thresholds for college readiness. The study showed that PARCC achieved its goal of identifying a threshold where a student would have at least a 75 percent chance of earning a “C” grade in college (students earning college-ready scores on the PARCC had an 85 percent chance of earning a C or better in math and an 89 percent chance of doing so in English Language Arts). The study also found that the PARCC college-ready thresholds correspond to a predicted grade of “B” in college courses. In mathematics, we found that the PARCC’s college-ready threshold was better aligned with “B” grades than was the MCAS threshold. This is useful information for parents, students, and policymakers.Nonetheless, Mr. Mathis is also correct that the correlations are low enough that many students (and parents, and colleges) would overestimate or underestimate their true college readiness—if they relied only on the test score to make the judgment. Fortunately, students have lots of other information available to inform their judgments alongside the test scores (most importantly, their high school grades). We wouldn’t recommend that anyone rely exclusively on the test score for high-stakes decisions.
And here is a response from Mathis:
Factually, Messrs. Nichols- Barrer and Gill, and I are in agreement. Their purpose was to compare MCAS with PARCC. However, the byproduct of their analysis provided a quite valuable contribution. That is, they produced a set of correlations that showed the PARCC “College and Career Ready” tests failed to predict “college readiness” effectively.If tests such as PARCC (or SBAC) do not measure the very attributes they were designed to measure, this calls into question the validity of test based reforms.Messrs. Nichols- Barrer and Gill are quite correct in saying standardized tests do not measure these things very well.
Thus, if used in a high stakes or consequential environment, such a usage would not be defensible.