As impressive as natural language processing algorithms and systems have become in recent years, they are still vulnerable to a kind of exploit known as an “adversarial example”. Adversarial examples of carefully engineered phrases that can cause an NLP system to behave in unexpected and undesirable ways. AI programs can be made to misbehave with these strange examples, and as a result, AI researchers are trying to design ways to protect against the effects of adversarial examples.
Recently, a team of researchers from both the University of Hong Kong and the Agency for Science, Technology, and Research in Singapore collaborated to create an algorithm that demonstrates the danger of adversarial examples. As Wired reported, the algorithm was dubbed TextFooler by the research team and it functions by subtly changing parts of a sentence, impacting how an NLP classifier might interpret the sentence. As an example, the algorithm converted one sentence to another similar sentence and the sentence was fed into a classifier designed to determine if a review was negative or positive. The original sentence was:
“The characters, cast in impossibly contrived situations, are totally estranged from reality.”
It was converted to this sentence:
“The characters, cast in impossibly engineered circumstances, are fully estranged from reality.”
These subtle changes prompted the text classifier to classify the review as positive instead of negative. The research team tested the same approach (swapping certain words with synonyms) on several different datasets and text classification algorithms. The research team reports that they were able to drop an algorithm’s classification accuracy to just 10%, down from 90%. This is despite the fact that people reading these sentences would interpret them to have the same meaning.
These results are concerning in an era where NLP algorithms and AI are being used more and more frequently, and for important tasks like assessing medical claims or analyzing legal documents. It’s unknown just how much of a danger to currently utilized algorithms adversarial examples are. Research teams around the world are still trying to ascertain just how much of an impact they can have. Recently, a report published by Stanford Human-Centered AI group suggested that adversarial examples could deceive AI algorithms and be used to perpetrate tax fraud.
There are some limitations to the recent study. For instance, while Sameer Singh, an assistant professor of computer science at UC Irvine, notes that the adversarial method used was effective, it relies on some knowledge of the AI’s architecture. The AI has to be repeatedly probed until an effective group of words can be found, and such repeated attacks might be noticed by security programs. Singh and colleagues have done their own research on the subject and found that advanced systems like OpenAI algorithms can deliver racist, harmful text when prompted with certain trigger phrases.
Adversarial examples are also a potential issue when dealing with visual data like photos or video. One famous example involves applying certain subtle digital transformations to an image of a kitten, prompting the image classifier to interpret it as a monitor or desktop PC. In another example, research done by UC Berekely professor Dawn Song found that adversarial examples can be used to change how road signs are perceived by computer vision systems, which could potentially be dangerous for autonomous vehicles.
Research like the kind done by the Hong Kong-Singapore team could help AI engineers better understand what kinds of vulnerabilites AI algorithms have, and potentially design ways to safeguard against these vulnerabilities. As an example, ensemble classifiers can be used to reduce the chance that an adversarial example will be able to deceive the computer vision system. With this technique, a number of classifiers are used and slight transformations are made to the input image. The majority of the classifiers will typically discern aspects of the image’s true content, which are then aggregated together. The result is that even if a few of the classifiers are fooled, most of them won’t be and the image will be properly classified.