Can machines really grade essays as well as humans? Naturally, there is a study that says they can (there are studies saying just about everything), but a number of experts say otherwise. Here’s one of them, writing about why the machines are inferior graders and should not be used. It was written by Maja Wilson, the author of “Rethinking Rubrics in Writing Assessment” and the coauthor, with Richard Haswell, of “Professionals Against Machine Scoring Of Student Essays In High-Stakes Assessment” (humanreaders.org). She taught adult basic education, alternative education, and high school English in Michigan’s public schools for ten years. This fall, she will join the teacher education faculty at the University of Maine, Farmington.
By Maja Wilson
On June 2, 2014, the Department of Homeland Security released a work tender that led to a frenzy of sarcastic tweets: the U.S. Secret Service is looking for a social media software analytics tool with “the ability to detect sarcasm…”
Why? While current analytics can identify trigger words in a tweet, they cannot understand what the tweeter meant, especially when sarcasm is involved. This leads to “false positives,” a stunning bit of understatement: a false positive to the Secret Service is an invasion of privacy, or worse, to a citizen.
Examining the Secret Service’s trouble with sarcastic tweets shines a light on how we use language to make meaning – and points to why we shouldn’t use language processing programs (automated essay scoring) in high-stakes writing tests.
User-interface developers who understand a thing or two about psychology may have lulled us into thinking that automated language programs understand us. The automated bank teller’s breathless yet maternal voice lays out all the things she can do for us, cheerfully announces that she’s sorry she “didn’t get that,” and even sounds as if she’s typing industriously on our behalf while our account is being accessed. But, as we know all too well the moment we get caught in an endless automated loop, she does not “get” us and cannot do anything for us. She’s not Scarlett Johansson – and she’s not even a she.
I’m perfectly okay with my automated teller’s limitations when I just want information about my balance or recent withdrawals. But it’s important to understand why automated programs can’t understand what we mean so that we can prevent and protest their use when meaning matters most.
A little-publicized phrase in the work tender gives us our first glimpse into why sarcasm – and meaning itself – eludes today’s automated language programs: in addition to detecting sarcasm, the tool should be able to perform “sentiment analysis.” Sentiment. It’s a word we hardly use anymore, and don’t expect to see associated with cutting-edge technology, but it’s at the crux of what makes human communication so gloriously complex. Our ability to detect sarcasm – and to understand language at all – depends on our ability to understand (and have) human sentiments, or feelings. And I’m not just talking about understanding poetry or a tear-jerker of a novel.
Meaning doesn’t just reside, it turns out, in word definitions and syntactic structures. If it did, the collective IQ and resources of Silicon Valley would already have figured out how to program a sarcasm detector. But meaning also resides in layers of context: in the communicative context that gives the speaker a reason to speak to an audience in the first place; the cultural context that give allusions or even phrases common resonance to certain groups; and the semantic context created as one sentence builds on another. And, in an irony that escapes whoever believes that an automated sarcasm detector is actually within our reach, making meaning from language depends on a skill currently only available to human beings: the ability to project our subjectivity – our emotions and experiences – onto words.
Sarcasm relies on a layer of feeling that runs directly opposite to the speaker’s words. It takes years for a human being to learn to understand sarcasm. We learn it first in verbal form when the speaker’s tone of voice or information from the environment don’t match what’s being said: Mom just said, “It’s another beautiful day!” but she sounds irritated and I can see that it is still raining outside.
Sarcasm is even more difficult to detect in written form, because the feeling contradicting the words is often invisible. Readers can sometimes infer sarcasm when they spot a textual contradiction – when two statements directly contradict each other, for instance. The reader must conclude that the writer is either being inconsistent, untruthful, controlled by someone else, or ironic. If the irony is particularly cutting or contemptuous, it qualifies as sarcasm.
But when we notice such a contradiction, how do we decide whether it’s sarcasm or one of the other options? We have to draw on our emotional intelligence.
Emotional intelligence – the ability to monitor our own and other people’s emotions – starts with awareness of our own feelings, and is followed by a life-long tension between projection and differentiation. That tension looks like this: you understand other people by projecting your own experiences and feelings onto them at the same time that you recognize that they are not you.
A simple example illustrates the necessity of keeping both paradoxical processes in play. When I was 3 years old, I shared a room briefly with my 5-year-old brother. We each slept with a plastic mug of water nearby. One night, as I responded to my own thirst with a satisfying draught from my mug, I became convinced that my brother must also be thirsty. Certain he would be grateful for my act of kindness, I poured water all over his sleeping face. I was astonished when he screamed bloody murder. I learned a powerful lesson about separate bodies and separate needs that evening. Still, my misguided impulse was a foundation for empathy.
Understanding what we read involves a similar tension. As we read, we have to project our own experiences and feelings onto the words to understand them at the same time that we must recognize that the author is not us, searching for clues about how the words represent experiences that are different from ours. It’s why great works of literature are valuable not just as repositories of cultural knowledge and values, but for the part they play in our development as social and emotionally intelligent human beings. It isn’t just that the characters we meet in the pages of books help us with reflection and perspective-taking. It’s that the act of reading anything at all requires and develops these skills on some basic level.
Contrast what’s involved in making meaning from text – emotional intelligence, understanding of context, projection and differentiation – with how automated systems work. Many specific programs protect their workings as proprietary secrets, but there are basic limitations to what even the best of these programs can do. Here’s an explanation for laypeople from Harvard linguist and psychologist Steven Pinker:
Early on in the history of artificial intelligence, computer scientists tried to see how much they could program of the human ability to deeply understand something…Because this turned out to be way too hard for the computer scientists of the 70’s and 80’s, they discovered a kind of kluge, a work-around. If you look at large bodies of text, and you soak up correlations – what word appears with what word and what other word – then you can get some part of the way…Unfortunately, in the current state of the art, that’s the way that computer language systems work. AI researchers gave up on the idea of actually understanding text. (Interview, November 2013)
It’s why the automated bank teller doesn’t actually understand your needs and why the Secret Service doesn’t think your last tweet was funny.
It’s also why Automated Essay Scoring (AES) shouldn’t be used in high-stakes writing tests. In late March, the Partnership for Assessment of Readiness for College and Careers (PARCC) gave a trial writing test to a million public school students. The essays written by these school children will be used to “train” the automated essay-scoring tool that they intend to roll out for their test of the Common Core State Standards (CCSS) in 2015. In other words, programs that cannot read or understand text, sarcastic or not, will be used by the school reform movement to make high-stakes decisions about the futures of millions of students, teachers, and administrators.
PARCC isn’t the first high-stakes assessment program to rely on AES. In 2004, Indiana used AES in its high-stakes assessment program. Accuplacer (marketed by the College Board) has been used by colleges and universities for years to place students into non-credit- bearing remedial writing courses. But PARCC’s use of AES is notable for its reach – currently, 14 states are planning to use PARCC to assess CCSS; and SmarterBalanced, which also intends to us AES in its CCSS tests, isn’t far behind.
Even if AES programs become good enough to match human scores on complex writing tasks, there are profound reasons not to use AES. First of all, its use puts a kind of reverse pressure on human scorers, who often work under conditions that don’t allow them to read with any kind of care or attention anyway. For an appalling and entertaining description of these conditions, read Todd Farley’s account of working for testing companies in his bestselling expose, “Making the Grades: My Misadventures in the Standardized Testing Industry.” When you give human readers two minutes to scan each essay, they are forced to look at the most superficial aspects of a student’s writing, and their scores will be more likely to match those generated by a computer. In a similar perversion, you can get humans to agree with computers (and humans to agree with other humans) more often when you severely constrain the topic and the genre in which students are allowed to show how well they can write.
It’s like what happens when you’re delighted to reach a human customer service representative only to realize that he’s reading off a script. My friend once very calmly clarified a point that he felt a customer service representative had misunderstood. The representative replied in a sing-song voice, “I can hear that you’re angry.” My friend hadn’t been angry at all. But the human representatives had been trained like computers.
Beyond the perversion of scoring, writing, and reading that takes place when AES competes with human readers, the use of AES in high-stakes testing undermines the teaching of writing in the most fundamental way possible. By teaching students to write, we are teaching them how to use language to convey their meaning. By evaluating their writing with a tool that isn’t even designed to understand meaning, we send students the opposite message, stunting their long-term development as writers: actually, meaning doesn’t matter.
AES in high-stakes testing affects more people than the Secret Service’s social media analytics. Over 4,000 writing scholars and professionals have already signed an evidence-based call to end the use of AES in high-stakes writing tests (humanreaders.org). Then again, what does the fate of our children matter compared to a promising source of profits for the corporations selling automated scoring systems?
That was sarcasm.
Correction: (Accuplacer is marketed by College Board, not ETS)