FiveThirtyEight is a blog created by Nate Silver, a famous statistician who developed a system for forecasting player performance in Major League Baseball and accurately predicted the winner of 49 out of 50 states in the 2008 presidential election. A few years ago, he was asked during an online Q & A session at Reddit whether he believes student standardized test scores should be used to evaluate teachers. His response went like:
There are certainly cases where applying objective measures badly is worse than not applying them at all, and education may well be one of those. In my job out of college as a consultant, one of my projects involved visiting public school classrooms in Ohio and talking to teachers, and their view was very much that teaching-to-the-test was constraining them in some unhelpful ways. But this is another topic that requires a book- or thesis-length treatment to really evaluate properly. Maybe I’ll write a book on it someday.
He hasn’t written the book yet, or published anything at length on the topic since. But now, Andrew Flowers, FiveThirtyEight’s quantitative editor, has published a piece on the blog about teacher evaluation with this headline, “The Science of Grading Teachers Gets High Marks.”
In the July 20 post — and in a response to questions I sent him — Flowers defends controversial “value-added modeling” (VAM) research conducted by three researchers that, among other things, predicts long-term student outcomes by using student standardized test scores to evaluate their teachers. One of those outcomes is future earnings, with the researchers predicting how much more students with “effective” teachers can earn vs. students with “ineffective” teachers — with effectiveness measured by VAM.
President Obama cited the research in his 2012 State of the Union address by saying, “We know a good teacher can increase the lifetime income of a classroom by over $250,000.” And a judge in the controversial Vergara v. California case referred to it as evidence to assert that “a single year in a classroom with a grossly ineffective teacher costs students $1.4 million in lifetime earnings per classroom.” (In Vergara, nine students claimed they got a bad education and blamed it on teacher tenure; the judge tossed out state statutes giving job protections to teachers but stayed his ruling pending appeal.)
That research has been scrutinized and criticized by a number of education researchers. Flowers noted some of the common criticisms of VAM in his post, but said the specific research about which he writes is “cutting-edge” and avoids the pitfalls of other value-added methods. (You can see his full response, plus a VAM expert’s comments, below.)
Value-added modeling (or measurement) is said by adherents to be able to take student standardized test scores and measure the “value” of a teacher by factoring out all of the other influences on student performance (such as hunger, sickness, grief, trauma, etc). Critics say that the formulas can’t do that for individual teachers and that they are unreliable and invalid for use in teacher evaluations. In some states, such as Florida and New York, 50 percent of a teacher’s evaluation is based on student standardized test scores.
Under the theory that just about everything can be reduced to numbers, the idea of evaluating teachers by student test scores was advanced years ago by economists, who have become the arbiters of teacher assessment. President Obama, whom many supporters thought would reduce the importance of standardized test scores, did the opposite. His Education Department supported value-added measurement and required states to evaluate individual educators in part by student standardized test scores in exchange for federal funding through his Race to the Top competition, or for waivers from the most egregious parts of No Child Left Behind.
Teachers have become highly suspicious of value-added scores. Currently only two subjects — math and English language arts — are tested with “accountability” exams, and as a result, unusual implementation methods have been devised to cover teachers in non-tested subjects. Most teachers assessed this way are evaluated by the test scores of students they don’t have or subjects they don’t teach. (Really.) How does this work?
Sometimes, school test averages are factored into all teachers’ evaluations. Sometimes, a certain group of teachers are attached to either reading or math scores; social studies teachers, for example, are more often attached to English Language Arts scores while science teachers are attached to math scores. An art teacher in New York City explained in this post how he was evaluated on math standardized test scores, and saw his evaluation rating drop from “effective” to “developing.” In Indian River County, Fla., an English Language Arts middle school teacher named Luke Flynt said that the highest-scoring students wound up hurting his evaluation because of the peculiarities of VAM modeling. (See more below about Flynt’s case below.)
Flowers’ post on FiveThirtyEight discusses what he calls a “cordial debate” between economists about the research mentioned above. He writes:
On one side is Raj Chetty of Harvard University, John Friedman of Brown University and Jonah Rockoff of Columbia University — hereafter referred to as “CFR” — who authored two influential papers published last year in the American Economic Review; Chetty testified for the [Vergara] plaintiffs in the case. On the other side is Jesse Rothstein, of the University of California at Berkeley, who published a critique of CFR’s methods and supported the state in the Vergara case.
The post talks about the “cutting edge” research of Chetty/Brown/Rockoff, who conducted their value-added analyses with a set of data that included more than 1 million student-level test and tax records. The study was supported by like-minded economists but also strongly critiqued by other researchers (including but not exclusively by Rothstein at Berkeley) who said, among other things, that the conclusions had been overdrawn, that what may be true in the aggregate may not apply to specific teachers because outside variables are too high, and that it is impossible to factor out every other influence on a student’s test performance to determine precisely how a teacher contributed.
Flowers ends his post by discussing conversations he had with two other economists, Thomas Kane of Harvard and Douglas Staiger of Dartmouth College, both VAM sympathizers, and writes:
“It’s almost like we’re doing real, hard science here,” [Brown University economist John] Friedman said. Well, almost. But by the standards of empirical social science — with all its limitations in experimental design, imperfect data, and the hard-to-capture behavior of individuals — it’s still impressive. The honest, respectful back-and-forth of dueling empirical approaches doesn’t mean the contentious nature of teacher evaluation will go away. But for what has been called the “credibility revolution” in empirical economics, it’s a win.”
A win? For whom is it a win?
Certainly not for teachers and their students. Larry Ferlazzo, award-winning veteran educator who teaches English and social studies at Luther Burbank High School in Sacramento, Calif., is one of the teachers who have expressed disappointment with the post, which you can see here.
Here are some other key problems with relying on VAM, even in part, to evaluate teachers:
*The quality of the underlying standardized assessment is assumed to be at least adequate — or why use the student scores to evaluate their teachers? — when, in fact, many of them are less than adequate to provide a well-rounded, authentic look at what students have learned and are able to do. The National Research Council in 2011 issued a report saying that standardized tests commonly used in schools to measure student performance “fall short of providing a complete measure of desired educational outcomes in many ways,” according to a summary. It is important to note that new accountability exams aligned to the Common Core State Standards, which were initially trumpeted as being far more sophisticated than the older tests, have not turned out to be the “game-changer” in assessment that Education Secretary Arne Duncan said they would. In fact, they still fall far short of being excellent assessments that can evaluate a wide range of student skills and abilities, according to Stanford University’s Linda Darling-Hammond.
*The post by Flowers suggests that VAM formulas are created and used with some degree of precision, when, in fact, many of them are plagued with mistakes.
Take the case of Flynt, who told his school board that through VAM formulas, each student is assigned a “predicted” score — based on past performance by that student and other students — on the state-mandated test. If the student exceeds the predicted score, the teacher is credited with “adding value.” If the student does not do as well as the predicted score, the teacher is held responsible and that score counts negatively towards his/her evaluation. He said four students had predicted scores that were “literally impossible” because those predicted scores were higher than the maximum number of points that can be earned on the exam. He said:
“One of my sixth-grade students had a predicted score of 286.34. However, the highest a sixth-grade student can earn earn is 283. The student did earn a 283, incidentally. Despite the fact that she earned a perfect score, she counted negatively toward my valuation because she was 3 points below predicted.”
He also said that while almost half of his students who counted toward his VAM — 50 of 102 — fell short of their predicted score, the negative image that presents is misleading. He said:
Of the 50 students who did not meet their predicted score, 10 percent missed zero or one question, 18 percent missed two or fewer questions, 36 percent missed three or fewer questions, 58 percent missed four or fewer questions. Let me stop to explain the magnitude of missing four or fewer questions. Since the reading FCAT [the test that was given] contained 45 questions, a student who missed four or fewer would have answered at least 90 percent of the questions correctly. That means that 58 percent of the students whose performance negatively affected my evaluation earned at least 90 percent of the possible points on the FCAT.
For years now, education and other assessment experts have said VAM is not ready to be used for high-stakes evaluations of individual teachers. The American Statistical Association, the largest organization in the United States representing statisticians and related professionals, released a 2014 report on using VAM for educational assessment saying in part:
- VAMs are complex statistical models, and high-level statistical expertise is needed to develop the models and interpret their results.
- Estimates from VAMs should always be accompanied by measures of precision and a discussion of the assumptions and possible limitations of the model. These limitations are particularly relevant if VAMs are used for high-stakes purposes.
- VAMs are generally based on standardized test scores, and do not directly measure potential teacher contributions toward other student outcomes.
- VAMs typically measure correlation, not causation: Effects – positive or negative – attributed to a teacher may actually be caused by other factors that are not captured in the model.
- Under some conditions, VAM scores and rankings can change substantially when a different model or test is used, and a thorough analysis should be undertaken to evaluate the sensitivity of estimates to different models.
While Flowers did refer to some of the criticism of VAM, he did not mention the American Statistical Association. In a comment to questions I sent him about his piece, he said most of the ASA statements don’t apply to the Chetty et al. research, though some VAM critics disagree. Here is what Flowers wrote to me about his post:
The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement[amstat.org], I felt it fair to characterize the CFR research as cutting-edge.
So, because the CFR research is so advanced, much of the ASA’s critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.
That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.
Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produce similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling.
What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right?
On the second argument, the two sides are still squabbling, as I make clear in the piece. But CFR’s latest rejoinder, which was the news peg of my article, sufficiently squashed two of Rothstein’s three criticism….
Earlier I said one could argue there is a third empirical argument. There is, but it was beside the point of this specific Rothstein-vs-CFR exchange. It’s whether, assuming VAMs are indeed unbiased (as CFR does), then: so what? Isn’t this all just standardized testing? What do they really predict?
For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.
Whether VAM is unbiased — that’s what I profiled. But assuming CFR’s statistical model is correct, and the VAMs are accurate, then there is huge gains to using VAM as a tool. By the way, when the ASA (in its statement) say that VAMs are identifying correlation, not causation, THIS DOES NOT APPLY to CFR, assuming their model is correct. They have identified causal significance, because their experiment is quasi-random (if you believe them). I do.
Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).
I have not read anything by the National Research Council. Besides, what is “high stakes” testing anyway? Is it testing that singularly determines teacher hiring/firing? If so, that’s not what the CFR researchers (as well as Kaine, Staiger, etc.) are advocating; all of them agree that VAMs should not be used in isolation. That is, they all believe that VAMs are one piece of a bigger puzzle….
PS: If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.
I asked Audrey Amrein-Beardsley, a former middle- and high-school mathematics teacher who is now associate professor in Arizona State University’s Mary Lou Fulton Teachers College and a VAM researcher, about the FiveThirtyEight blog post and e-mail comments by Flowers. She earned a Ph.D. in 2002 from Arizona State University in the Division of Educational Leadership and Policy Studies with an emphasis on research methods. She had already written about Flowers’ blog post on her VAMBoozled! blog, which you can see here.
Here are her comments on what Flowers wrote to me in the e-mail. Some of them are technical, as any discussion about formulas would be:
Flowers: “The piece I wrote that was recently published by FiveThirtyEight was focused on a specific type of value-added model (VAM) — the one developed by Chetty, Friedman and Rockoff (CFR). In my reading of the literature on VAMs, including the American Statistical Association’s (ASA) statement, I felt it fair to characterize the CFR research as cutting-edge.”
Amrein-Beardsley: There is no such thing as a “cutting-edge” VAM. Just because Chetty had access to millions of data observations does not make his actual VAM more sophisticated than any of those in use otherwise or in other ways. The fact of the matter is is that all states have essentially the same school level data (i.e., very similar test scores by students over time, links to teachers, and series of typically dichotomous/binary variables meant to capture things like special education status, English language status, free-and-reduced lunch eligibility, etc.). These latter variables are the ones used, or not used depending on the model for VAM-based analyses. While Chetty used these data and also had access to other demographic data (e.g., IRS data, correlated with other demographic data as well), and he could use these data to supplement the data from NYC schools, the data whether dichotomous or continuous (which is a step in the right direction) still cannot and do not capture all of the things we know from the research that influence student learning, achievement, and more specifically growth in achievement in schools. These are the unquantifiable/uncontrollable variables that (will likely forever) continue to distort the measurement of teachers’ causal effects, and that cannot be captured using IRS data alone. For example, unless Chetty had data to capture teachers’ residuals effects (from prior years), out of school learning, parental impacts on learning or a lack thereof, summer learning and decay, etc. it is virtually impossible, no matter how sophisticated any model or dataset is, to make such causal claims. Yes, such demographic variables are correlated with, for example, family income [but] they are not correlated to the extent that they can remove systematic error from the model.
Accordingly, Chetty’s model is no more sophisticated or “cutting–edge” than any other. There are probably, now, five+ models being used today (i.e., the EVAAS, the Value–Added Research Center (VARC) model, the RAND Corporation model, the American Institute for Research (AIR) model, and the Student Growth Percentiles (SGP) model). All of the them except for the SGP have been developed by economists, and they are likely just as sophisticated in their design (1) given minor tweaks to model specifications and (2) given various data limitations and restrictions. In fact, the EVAAS, because it’s been around for over twenty years (in use in Tennessee since 1993, and in years of development prior), is probably considered the best and most sophisticated of all VAMs, and because it’s now run by the SAS analytics software corporation, I (and likely many other VAM researchers) would likely put our money down on that model any day over Chetty’s model, if both had access to the same dataset. Chetty might even agree with this assertion, although he would disagree with the EVAAS’s (typical) lack of use of controls for student background variables/demographics — a point of contention that has been debated, now, for years, with research evidence supporting both approaches; hence, the intense debates about VAM–based bias, now also going on for years.
Flowers: “So, because the CFR research is so advanced, much of the ASA’s [American Statistical Association’s] critique does not apply to it. In its statement, the ASA says VAMs “generally… not directly measure potential teacher contributions toward other student outcomes” (emphasis added). Well, this CFR work I profiled is the exception — it explicitly controls for student demographic variables (by using millions of IRS records linked to their parents). And, as I’ll explain below, the ASA statement’s point that VAMs are only capturing correlation, not causation, also does not apply to the CFR model (in my view). The ASA statement is still smart, though. I’m not dismissing it. I just thought — given how superb the CFR research was — that it wasn’t really directed at the paper I covered.”
Amrein-Beardsley: This is based on the false assumption, addressed above, that Chetty’s model is “so advanced” or “cutting edge,” or now as written here “superb.” When you appropriately remove or reject this assumption, ASA’s critique applies to Chetty’s model along with the rest of them. Should we not give credit to the ASA for taking into consideration all models when they wrote this statement, especially as they wrote their statement well after Chetty’s model had hit the public? Would the ASA not have written, somewhere, that their critique applies to all models “except for” the one used by Chetty et al because they too agreed this one was exempt from their critiques? This singular statement is absurd in and of itself, as is the statement that Flowers isn’t “dismissing it.” I’m sure the ASA would be thrilled to hear. More specifically, the majority of models “explicitly control for student demographics” — Chetty’s model is by far not the only one (see the first response above, as again, this is one of the most contentious issues going). Given this, and the above, it is true that all “VAMs are only capturing correlation, not causation,” and all VAMs are doing this at a mediocre level of quality. The true challenge, should Chetty take it on, would be to put his model up against the other VAMs mentioned above, using the same NYC school-level dataset, and prove to the public that his model is so “cutting-edge” that it does not suffer from the serious issues with reliability, validity, bias, etc. with which all other modelers are contending. Perhaps Flowers’ main problem in this piece is that he conflated model sophistication with dataset quality, whereby the former is likely no better (or worse) than any of the others.
Lastly, for what “wasn’t really directly at the paper [Flowers] covered…let’s talk about the 20+ years of research we have on VAMs that Flowers dismissed, implicitly in that it was not written by economists, whereas Jesse Rothstein was positioned as the only respected critic of VAMs. My best estimates, and I’ll stick with them today, is that approximately 90 percent of all value-added researchers, including econometricians and statisticians alike, have grave concerns about these models, and consensus has been reached regarding many of their current issues. Only folks like Chetty and Kain (the two-pro VAM scholars), however, were positioned as leading thought and research in this area. Flowers, before he wrote such a piece, really should have done more homework. This also includes the other critiques of Chetty’s work, not mentioned whatsoever in this piece albeit very important to understanding it (see, for example, here, here, here, and here).
Flowers: “That said, I felt like the criticism of the CFR work by other academic economists, as well as the general caution of the ASA, warranted inclusion — and so I reached out to Jesse Rothstein, the most respected “anti-VAM” economist, for comment. I started and ended the piece with the perspective of “pro-VAM” voices because that was the peg of the story — this new exchange between CFR and Rothstein — and, if one reads both papers and talks to both sides, I though it was clear how the debate tilted in the favor of CFR.”
Amrein-Beardsley: Again, why only the critiques of other “academic economists,” or actually just one other academic economist to be specific (i.e., Jesse Rothstein, who most would agree is “the most respected ‘anti-VAM’ economist)? Everybody knows Chetty and Kane (the other economist to whom Flowers “reached out) are colleagues/buddies and very much on the same page and side of all of this, so Rothstein was really the only respected critic included to represent the other side. All of this is biased in and of itself (see also studies above for economists’ and statisticians’ other critiques),and quite frankly insulting to/marginalizing of the other well-respected scholars also conducting solid empirical research in this area (e.g., Henry Braun, Stephen Raudenbush, Jonathan Papay, Sean Corcoran). Nonetheless, this “new exchange” between Chetty and Rothstein is not “new” as claimed. It actually started back in October to be specific (see, here, for example). I too have read both papers and talked to both sides, and would hardly say it’s “clear how the debate” tilts either way. It’s educational research, and complicated, and not nearly objective, hard, conclusive, or ultimately victorious as Flowers claims.
Flowers: “Now, why is that? I think there are two (one could argue three) empirical arguments at stake here. First, are the CFR results, based on NYC public schools, reproducible in other settings? If not — if other researchers can’t produced similar estimates with different data — then that calls it into question. Second, assuming the reproducibility bar is passed, can the CFR’s specification model withstand scrutiny; that is, is CFR’s claim to capture teacher value-added in isolation of all other factors (e.g., demographic characteristics, student sorting, etc.) really believable? This second argument is less about data than about statistical modeling…What I found was that there was complete agreement (even by Rothstein) on this first empirical argument. CFR’s results are reproducible even by their critics, in different settings (Rothstein replicated in North Carolina). That’s amazing, right? “
Amrein-Beardsley: These claims are actually quite interesting in that there is a growing set of research evidence that all models, using the same datasets, actually yield similar results. It’s really no surprise, and certainly not “amazing” that Kane replicated Chetty’s results, or that Rothstein replicated them, more or less, as well. Even what some argue is the least sophisticated VAM (although some would cringe calling it a VAM) – the Student Growth Percentiles (SGP) model – has demonstrated itself, even without using student demographics in model specifications/controls, to yield similar output when the same datasets are used. One of my doctoral students, in fact, ran five different models using the same dataset and yielded inter/intra correlations that some could actually consider “amazing.” That is because, what at least some contend, these models are quite similar, and yield similar results given their similarities, and also their limitations. Some even go as far as calling all such models “garbage in, garbage out” systems, given the test data they all (typically) use to generate VAM-based estimates, and almost regardless of the extent to which model specifications differ. So replication, in this case, is certainly not the cat’s meow. One must also look to other traditional notions of educational measurement: reliability/consistency (which is not at high-enough levels, especially across teacher types), validity (which is not at high-enough levels, especially for high-stakes purposes), etc. in that “replicability” alone is more common than Flowers (and perhaps others) might assume. Just like it takes multiple measures to get at teachers’ effects, it takes multiple measures to assess model quality. Using replication, alone, is remiss.
Flowers: “For those curious about this third empirical argument, I would refer anyone back to CFR’s second paper in (American Economic Review 2014b), where they impressively demonstrate how students taught by teachers with high VAM scores, all things equal, grow up to have higher earnings (through age 28), avoid teen pregnancy at greater rates, attend better colleges, etc. This is based off an administrative data set from the IRS — that’s millions of students, over 30 years. Of course, it all hinges on the first study’s validity (that VAM is unbiased)— which was the center of debate between Rothstein and CFR.”
Amrein-Beardsley: The jury is definitely still out on this, across all studies…. Plenty of studies demonstrate (with solid evidence) that bias exists and plenty others demonstrate (with solid evidence) that it doesn’t.
Flowers: “Long story, short: the CFR research has withstood criticism from Rothstein (a brilliant economist, whom CFR greatly respects), and their findings were backed up by other economists in the field (yes, some of them do have a “pro-VAM” bias, but such is social science).”
Amrein-Beardsley: Long story, short: the CFR research has [not] withstood criticism from Rothstein (a brilliant economist, whom CFR [and many others] greatly respect, and their findings were backed up by other economists [i.e., two to be exact] in the field (yes, some of them [only Chetty’s buddy Kane] do have a “pro-VAM” bias, but such is social science). Such is the biased stance taken by Flowers in this piece, as well.
Flowers: “If one really wants to poke holes in the CFR research, I’d look to its setting: New York City. What if NYC’s standardized test are just better at capturing students’ long-run achievement? That’s possible. If it’s hard to do what NYC does elsewhere in the U.S., then CFR’s results may not apply.”
Amrein-Beardsley: First, plenty of respected researchers have already poked what I would consider as “enough” holes in the CFR research. Second, Flowers clearly does not know much about current standardized tests in that they are all constructed under contract with the same testing companies, they all include the same types of items, they all measure (more or less) the same set of standards… they all undergo the same sets of bias, discrimination, etc. analyses, and the like. As for their capacities to measure growth, they all suffer from a lack of horizontal, but more importantly, vertical equating; their growth output are all distorted because the tests (from pre to post) all capture one full year’s of growth; and they cannot isolate teachers’ residuals, summer growth/decay, etc. given that the pretests are not given the same year, within the same teacher’s classroom.
Flynt, the Florida teacher, posed this to his school board, a specific comment that speaks to the broader VAM issue:
Where is the value in the value-added model? How does all of this data and the enormous mount of time spent testing add value to me as a teacher, to students, to parents or to the community at large. It leads me to wonder what more can I possibly do, when the state issues predictions for my students that are impossible for them to meet, when I suffer financially because of my students test scores, what more can I do?