Rohit Prasad, vice president and head scientist of Alexa at Amazon, recently argued that the Turing test, long used to measure the sophistication of AI models, should be retired as a benchmark for AI.
Computer scientist and mathematician Alan Turing originally introduced the concept of the Turing test more than 70 years ago. The intention of the Turing test was to aid in answering the question of machine intelligence, determining if a machine was capable of “thought” in the human sense. In order to answer this question, Turing argued that if machines could exhibit conversational behavior so sophisticated that a human observer couldn’t discern between the computer’s dialogue and a human’s dialogue, the machine ought to be considered capable of thought.
Turing Test Limitations
Prasad argued that the Turing test is limited in many ways and that Turing himself even remarked on some of these limitations in his initial paper. As AI has become more and more integrated into every facet of our lives, people care less that it is indistinguishable from a human and more that their interactions with AI are seamless, Prasad argues. For this reason, the Turing test should be considered obsolete and replaced with more useful benchmarks.
Prasad noted that many early chatbots were designed with passing the Turing test in mind, and in recent years some chatbots have consistently managed to trick more than a third of human judges (the bar that was required to pass the Turing test). However, being able to successfully mimic the speech patterns of humans doesn’t mean that a machine can truly be considered “intelligent”. AI models can be extremely proficient in one area and extremely lacking in others, possessing no form of general intelligence. Despite this, the Turing test remains a commonly used benchmark for chatbots and digital assistants, with Prasad noting that business leaders and journalists constantly ask when Alexa will be capable of passing the Turing test.
According to Prasad, one of the primary issues with using the Turing test to assess machine intelligence is that it almost entirely discounts the ability of machines to lookup information and carry out lightning fast computations. AI programs inject artificial pauses in response to complicated math and geography questions to trick humans, but they have an answer to such questions almost instantly. Beyond this, the Turing test doesn’t take AI’s increasing ability to use data gathered by outside sensors into account, ignoring how AIs can interact with the world around them through vision and motion algorithms, relying only on text communication.
Creating New Benchmarks
Prasad argued that new forms of measuring intelligence should be created, methods that are better suited to assessing a general type of intelligence. These tests should reflect how AI is actually used in modern society and people’s goals for using it. The tests should be able to ascertain how well an AI augments human intelligence and how well the AI improves people’s daily lives. Further, a test should understand how an AI is manifesting human-like features of intelligence, including language proficiency, self-supervision, and “common sense.”
The current and important fields of AI research, like reasoning, fairness, conversing, and sensory understanding aren’t evaluated by the Turing test, but they can be measured in a variety of ways. Prasad explained that one way of measuring these features of intelligence is by breaking challenges down into constituent tasks. Another method for evaluating is creating a large-scale real-world challenge for human-computer interaction.
When Amazon created the Alexa Prize, it created a rubric that required social bots to speak with a human for 20 minutes. The bots would be assessed on their ability to converse coherently on a wide variety of topics like technology, sports, politics and entertainment. Customers were responsible for scoring the bots during the development phase, assigning them scores based on their desire to chat with the bot again. During the final round, independent judges were responsible for grading the bots using a 5-point scale. The rubric used by the judges relied on methods that let AIs exhibit important human attributes like empathy where appropriate.
Ultimately, Prasad argued that the increasing proliferation of AI-powered devices like Alexa represents an important opportunity to measure the progress of AI, but we will need different metrics to take advantage of this new opportunity.
“Such AIs need to be an expert in a large, ever-increasing number of tasks, which is only possible with more generalized learning capability instead of task-specific intelligence,” Prasad explained. “Therefore, for the next decade and beyond, the utility of AI services, with their conversational and proactive assistance abilities on ambient devices, are a worthy test.”