In 1950, the ingenious computer scientist Alan Turing proposed a thought experiment he called the Imitation Game. An interviewer converses via typewriter with two subjects, knowing one is human and the other a machine. If a machine could consistently fool the interviewer into believing it was the human, Turing suggested, we might speak of it as capable of something like thinking.
Whether machines could actually think, Turing believed, was a question “too meaningless to deserve discussion.” Nonetheless, the “Turing test” became a benchmark for machine intelligence. Over the decades, various computer programs vied to pass it using cheap conversational tricks, with some success.
In recent years, wealthy tech firms including Google, Facebook and OpenAI have developed a new class of computer programs known as “large language models,” with conversational capabilities far beyond the rudimentary chatbots of yore. One of those models — Google’s LaMDA — has convinced Google engineer Blake Lemoine that it is not only intelligent but conscious and sentient.
If Lemoine was taken in by LaMDA’s lifelike responses, it seems plausible that many other people with far less understanding of artificial intelligence, AI, could be as well — which speaks to its potential as a tool of deception and manipulation, in the wrong hands.
To many in the field, then, LaMDA’s remarkable aptitude at Turing’s Imitation Game is not an achievement to be celebrated. If anything, it shows that the venerable test has outlived its use as a lodestar for artificial intelligence.
“These tests aren’t really getting at intelligence,” said Gary Marcus, a cognitive scientist and co-author of the book “Rebooting AI.” What it’s getting at is the capacity of a given software program to pass as human, at least under certain conditions. Which, come to think of it, might not be such a good thing for society.
“I don’t think it’s an advance toward intelligence,” Marcus said of programs like LaMDA generating humanlike prose or conversation. “It’s an advance toward fooling people that you have intelligence.”
Lemoine may be an outlier among his peers in the industry. Both Google and outside experts on AI say that the program does not, and could not possibly, possess anything like the inner life he imagines. We don’t need to worry about LaMDA turning into Skynet, the malevolent machine mind from the Terminator movies, anytime soon.
But there is cause for a different set of worries, now that we live in the world Turing predicted: one in which computer programs are advanced enough that they can seem to people to possess agency of their own, even if they actually don’t.
Cutting-edge artificial intelligence programs, such as OpenAI’s GPT-3 text generator and image generator DALL-E 2, are focused on generating uncannily humanlike creations by drawing on immense data sets and vast computing power. They represent a far more powerful, sophisticated approach to software development than was possible when programmers in the 1960s gave a chatbot called ELIZA canned responses to various verbal cues in a bid to hoodwink human interlocutors. And they may have commercial applications in everyday tools, such as search engines, autocomplete suggestions, and voice assistants like Apple’s Siri and Amazon’s Alexa.
It’s also worth noting that the AI sector has largely moved on from using the Turing test as an explicit benchmark. The designers of large language models now aim for high scores on tests such as the General Language Understanding Evaluation, or GLUE, and the Stanford Question Answering Dataset, or SQuAD. And unlike ELIZA, LaMDA wasn’t built with the specific intention of passing as human; it’s just very good at stitching together and spitting out plausible-sounding responses to all kinds of questions.
Yet beneath that sophistication, today’s models and tests share with the Turing test the underlying goal of producing outputs that are as humanlike as possible. That “arms race,” as the AI ethicist Margaret Mitchell called it in a Twitter Spaces conversation with Washington Post reporters on Wednesday, has come at the expense of all sorts of other possible goals for language models. Those include ensuring that their workings are understandable and that they don’t mislead people or inadvertently reinforce harmful biases. Mitchell and her former colleague Timnit Gebru were fired by Google in 2021 and 2020, respectively, after they co-authored a paper highlighting those and other risks of large language models.
While Google has distanced itself from Lemoine’s claims, it and other industry leaders have at other times celebrated their systems’ ability to trick people, as Jeremy Kahn pointed out this week in his Fortune newsletter, “Eye on A.I.” At a public event in 2018, for instance, the company proudly played recordings of a voice assistant called Duplex, complete with verbal tics like “umm” and “mm-hm,” that fooled receptionists into thinking it was a human when it called to book appointments. (After a backlash, Google promised the system would identify itself as automated.)
“The Turing Test’s most troubling legacy is an ethical one: The test is fundamentally about deception,” Kahn wrote. “And here the test’s impact on the field has been very real and disturbing.”
Kahn reiterated a call, often voiced by AI critics and commentators, to retire the Turing test and move on. Of course, the industry already has, in the sense that it has replaced the Imitation Game with more scientific benchmarks.
But the Lemoine story suggests that perhaps the Turing test could serve a different purpose in an era when machines are increasingly adept at sounding human. Rather than being an aspirational standard, the Turing test should serve as an ethical red flag: Any system capable of passing it carries the danger of deceiving people.