A veteran teacher suing New York state education officials over the controversial method they used to evaluate her as “ineffective” is expected to go to New York Supreme Court in Albany this week for oral arguments in a case that could affect all public school teachers in the state and even beyond.

Sheri G. Lederman, a fourth-grade teacher in New York’s Great Neck public school district, is “highly regarded as an educator,” according to her district superintendent, Thomas Dolan, and has a “flawless record”. The standardized math and English Language Arts test scores of her students are consistently higher than the state average.

Yet her 2013-2014 evaluation, based in part on student standardized test scores, rated her as “ineffective.” How can a teacher known for excellence be rated “ineffective”? It happens — and not just in New York.

The evaluation method, known as value-added measurement (or modeling), purports to be able to predict through a complicated computer model how students with similar characteristics are supposed to perform on the exams — and how much growth they are supposed to show over time — and then rate teachers on how well their students measure up to the theoretical students. New York is just one of the many states where VAM is one of the chief components used to evaluate teachers.

Testing experts have for years been warning school reformers that efforts to evaluate teachers using VAM are not reliable or valid, but school reformers, including Education Secretary Arne Duncan and New York Gov. Andrew Cuomo, both Democrats, have embraced the method as a “data-driven” evaluation solution championed by some economists.

Lederman’s suit against state education officials — including John King, the former state education commissioner, who now is a top adviser to Duncan at the Education Department — challenges the rationality of the VAM model used to evaluate her and, by extension, other teachers in the state. The lawsuit alleges that the New York State Growth Measures “actually punishes excellence in education through a statistical black box which no rational educator or fact finder could see as fair, accurate or reliable.”

It also, in many aspects, defies comprehension. High-stakes tests are given only in math and English language arts, so reformers have decided that all teachers (and, sometimes, principals) in a school should be evaluated by reading and math scores. Sometimes, school test averages are factored into all teachers’ evaluations. Sometimes, a certain group of teachers are attached to either reading or math scores; social studies teachers, for example, are more often attached to English Language Arts scores, while science teachers are attached to math scores. An art teacher in New York City explained in this post how he was evaluated on math standardized test scores and saw his evaluation rating drop from “effective” to “developing.”

A teacher in Florida — which is another state that uses VAM — discovered that his top-scoring students actually hurt his evaluation. How? In Indian River County, an English Language Arts middle school teacher named Luke Flynt told his school board that through VAM formulas, each student is assigned a “predicted” score — based on past performance by that student and other students — on the state-mandated standardized test. If the student exceeds the predicted score, the teacher is credited with “adding value.” If the student does not do as well as the predicted score, the teacher is held responsible and that score counts negatively toward his/her evaluation. He said he had four students whose predicted scores were “literally impossible” because they were higher than the maximum number of points that can be earned on the exam. He said:

“One of my sixth-grade students had a predicted score of 286.34. However, the highest a sixth-grade student can earn earn is 283. The student did earn a 283, incidentally. Despite the fact that she earned a perfect score, she counted negatively toward my valuation because she was 3 points below predicted.

Hard to believe, isn’t it?

In 2012-13, 68.75 percent of Lederman’s New York students met or exceeded state standards in both English and math. She was labeled “effective” that year. In 2013-2014, her students’ test results were very similar, but she was rated “ineffective.” Dolan, the superintendent, said in an affidavit:

As superintendent of the GNPS, I have personally known Dr. Lederman for approximately 4 years. I have had the opportunity to meet with her personally. I have also reviewed her record of teaching, particularly the performance of her students on New York State assessment tests. I can personally attest that she is highly regarded as an educator by the administration of GNPS. Her classroom observations have consistently identified her as an exceptional educator. She is widely regarded in the GNPS as someone who brings out the best in her students. She has taught for seventeen (17) years in the GNPS and her record is flawless.

Affidavits of numerous experts supporting Lederman have been filed — including from Stanford University professor Linda Darling-Hammond — and you can see them here. Oral arguments are scheduled to be heard Wednesday, Aug. 12. Should Lederman successfully challenge the New York teacher evaluation system, state officials might have to revamp it.

A legal memo from Lederman’s attorney, Bruce Lederman (who is also her husband), which responds to filings from the state, describes the situation this way:

Stripped of rhetoric, Respondents’ explanation is that a complex computer program — the operation of which is not transparent as required by New York State Education Law § 3012-c(2)(j)(1) — which purportedly takes into account the effects of poverty, English language fluency, and learning disability in crude and undisclosed ways,4 predicted that Petitioner’s 4th grade students would score better than they did. Based upon the alleged “failure” of Petitioner’s students to meet computer predictions that generate a so-called student growth percentile (“SGP”)5 for each of her fifteen (15) students, the algorithm somehow adjusted the SPGs and averaged the adjusted SGPs to create a so-called mean growth percentile (“MGP”) which Respondents’ computer program in turn used to determine that she was an “ineffective educator” in terms of promoting growth. As stated in the Petition, this process is a statistical black box which no rational fact finder could see as fair, accurate or reliable.

It also discusses affidavits of experts in teacher evaluation and/or measurement, which I am including here because they show the very serious problems with the New York teacher evaluation system:

The original Petition was accompanied by expert affidavits from Professor Linda Darling-Hammond (Stanford University), Professor Aaron Pallas (Columbia University Teachers College), Professor Audrey Amrein-Beardsley (Arizona State University), Carol Burris, (Long Island principal who has been recently recognized as both the Educator of the Year and the Principal of the Year), and Brad Lindell (Long Island research consultant and school psychologist who has conducted a detailed study of the Respondents’ teacher evaluation system). These individuals are nationally and locally renowned and respected in their respective fields, and are unaffiliated with Petitioner.
The original affidavit of Professor Linda Darling-Hammond of Stanford University, sworn to February 28, 2015, that the assessment being used in Respondents’ Growth Model does not allow measurement of growth for high-achieving and low achieving students: the learning of both high-achieving and low-achieving students is mis-measured because of the fact that the state tests pegged to grade-level standards do not include items that can measure growth for students who are already above grade level in their skills or who fall considerably below. This is not a problem that can be corrected by statistical adjustments.
(Original Darling-Hammond Aff. at ¶22.) Professor Darling-Hammond expressed a clear and specific opinion that the volatility in Petitioner’s ratings between school year 2012/13 and 2013/14 rendered the results irrational: The unexplained swing of Petitioner’s rating from 2012-2013 when she was identified as an “effective” teacher with a rating of 14 out of 20, to 2013-2014, when she was identified as an “ineffective” teacher with a rating of 1 out of 20, is clear proof the model and rating system, as applied to Petitioner, are irrational. These large swings demonstrate that, whatever it is the model is measuring, it is not measuring a stable construct we would recognize as reflecting teacher effectiveness, which should not and does not, in fact, change dramatically from year to year.
(Id. at ¶25.) Professor Darling-Hammond further expressed an expert opinion that the lack of any review process by Respondents was itself irrational. (Id. at ¶30.) Given the widely known challenges to accurate assessment of teachers using value-added methods, it is my further belief that the absence of any way for a particular teacher to challenge a VAM score on an individualized basis is irrational. For all the reasons outlined above, it is well-known and accepted among researchers that a particular individual score produced by a VAM procedure, even on the best developed and administrated model, may be wrong for a variety of reasons. It is my opinion that any teacher receiving an adverse score on a VAM model therefore should have the right to understand why he or she got the score awarded and appeal that score based upon the particular facts of any given case.
The original affidavit of Professor Aaron Pallas of Columbia University, sworn to February 25, 2015, explained, among other things, that the Respondents’ Growth Model was flawed because it pre-determined that only 7% of teachers could ever be rated as highly effective, and mandated that 7% of teachers would be rated as ineffective (the top and bottom ratings, respectively) without any scientific definition of effectiveness. (Original Pallas Aff. at ¶¶13, 14.) Professor Pallas also explained that the model did not provide a teacher, or his/her supervisor, with information at the beginning of the school year about what level of student
performance would result in the teacher being rated Highly Effective, Effective, Developing or Ineffective. Professor Pallas explained that, in New York, nobody knows before the end of the school year what was required of a teacher during the school year in order for that teacher to obtain a particular rating for that particular school year. (Id. at ¶¶s 20-22.)
The original affidavit of Professor Audrey Amrein-Beardsley, of Arizona State University, sworn to February 28, 2015, explained that VAM models, such as that being used by Respondent, have a 50% chance of showing that a teacher caused growth in students one year,
and a 50% chance of showing that the teacher detracted from student growth, so “the probabilitythat this teacher was truly effective or ineffective was no different that the flip of a coin.” (Amrein-Beardsley Aff. at ¶9.)
Dr. Carol Burris, a New York State Educator of the Year, and fellow of the National Education Policy Center, in her affidavit sworn to February 27, 2015, explained, among other things, that something is “seriously wrong if teachers, many of whom teach the same grade level, in the same school, using the same curriculum, could have their scores dramatically shift each year.”
Dr. Brad Lindell, a school psychologist with an advanced degree in applied research, in his affidavit sworn to March 1, 2015, provided the following example of why evaluations need to be reliable to be considered valid:
To understand the importance of reliability, I asked the Court to please consider the following. One of the most well-known measures of intelligence is the Wechsler Intelligence Scale for Children –V (WISC-V). One of the scores the WISC-V provides is a full scale IQ score that ranges from 40 to 160 with an IQ score of 100 being the average score. The reliability of this score with a one-year interval between two administrations of the test is generally in the .80 to .90 range. This indicates strong what is called test-retest reliability and the stability of the test over time. This means that if a student scored a 100 one year, it would be expected that when he is administered the test a year later, he would receive a similar score. If he received a score of 100 (average score) and then received a score of 120 (superior score) a year later and then a score of 85 (low average score) the following
year, the WISC-V would then not be considered reliable. If it was not reliable, then it could not be a valid measure of intelligence. The same argument is relevant to teacher assigned VAM scores. In order for the VAM scores to be a valid measure of teacher effectiveness, it needs to be reliable.