Artificial Intelligence

Teaching AI to Understand and Use Images in Dialogue

Updated on December 9, 2022

Researchers from South Korea have developed a dataset designed to aid research into AI's understanding of the way that humans use images in dialogue, and to help natural language models to participate in this very recent development in human communications.

The paper, from KAIST at Daedeok Innopolis, notes that research into such multi-modal dialogue systems over the last ten years has been hamstrung by datasets and methodologies centering on disciplines that are peripheral to the topic, such as visual question answering and image captioning.

In these older approaches, images are evaluated out of the lexical context of a conversation, with no understanding of the way that the dialogue is enhanced and developed by image responses, and no cross-domain schema for decoding the contributions of visual contributions to discourse.

Images as First-Class Facets of Dialogue

Many of the aforementioned approaches to date have been initiatives or developments from Microsoft's AI research arm, which in 2017 also examined the topic of multimodal conversations that are begun by an image, rather than freely using images as dialogue components.

To address the shortfall in research data, the South Korean researchers have developed a dataset of 45,000 dialogue instances involving the ad hoc usage of images, without concentrating on viral ‘meme' images; the latter, though an area of interest in language research, is arguably less of a challenge, because the meaning of viral memes can be inferred more easily through thousands of in-context uses on social media platforms.

Developing Illustrations as a Substitute for Text

In order to develop a methodology for word/phrase>image bilateral transliteration, the South Korean researchers have trained a machine learning system to substitute parts of a text-based conversation into semantically relevant image content.

Architecture of the Korean system for generating a dataset for multimodal dialogue research. Source: https://arxiv.org/pdf/2107.08685.pdf

Pre-processing the target phrases involved the deletion of stop words that might inhibit prediction of the following sally in the conversation, and the pruning of inferior quality exchanges via contextual similarity filters.

To test the utility of the dataset, the researchers set a module to predict the next ‘turn' in the dialogue while considering the context of the conversation and the images involved.

The human evaluation GUI used in the research.

Five external datasets were used as base material for the 45k dataset (which is available on GitHub). Three are text-based elements: DailyDialog, a manually-annotated multi-turn text-based set from 2017; and Facebook's EmpatheticDialogues and PersonaChat, both from 2018. The two image-based datasets used were MS-COCO and Flicker30k.

Image/text pairs – JSON schema of phrases in the dataset, associated with images (in this example) from Microsoft's COCO image database.

Text to image replacement for the system was powered by the pre-trained Visual Semantic Reasoning Network (VSRN), developed in 2019 out of Northeastern University at Boston. VSRN was set to operate on manually pre-selected phrases from the contributing text datasets.

Establishing Coherence

Coherence of the source datasets was established by developing six combinations of each dialogue dataset, correlated to instances in each image dataset, and evaluated over several rounds by humans.

The human scoring was based on three criteria: consistency to the context of the exchange; image-relevance to the core concept the image was trying to express; and the extent to which the image contained key objects from the target sentence.

Considering the latter criteria, it could be argued that the schema the researchers decided on has largely discounted the possibility of humorous, sarcastic, abstract or metaphysical possibilities for the semantic meaning of an image that might be injected into a text conversation.

However, this is seminal work, and it has to start somewhere, while considerable efforts are being expended elsewhere in the Natural Language Processing (NLP) sector to map instances of sarcasm, among other less tangible examples of the image/text relationship.

Testing

To test the data generation framework, the researchers used a three-part retrieval model based on Facebook's 2020 Image-Chat research. The module comprises Resnext-101 as an image encoder; Google's BERT for the text encoder; and a custom fusion module for these.

The system achieved 50.35 and 14.38 on the current and next sentence prediction task, improving on the baseline for each task.

Later, two researchers were tasked with creating 100 multimodal dialogues by inserting images into conversations manually, and running the system against these ‘organic' multimodal conversations. The system was able to predict current and next-turn exchanges with high awareness of context even for these ad hoc examples.

Results of the testing for the Korean multimodal dataset generation system, revealing consistently high correlation between text-to-image similarity and human-based question scores on the same data.