The key concept lies in the fact that “data sets are fundamental building blocks of AI systems,” and that without data sets, “models can’t learn the relationships that inform their predictions.” The ABI report predicts that “while the total installed base of AI devices will grow from 2.69 billion in 2019 to 4.47 billion in 2024, comparatively few will be interoperable in the short term.”
This could represent a considerable waste of time, energy and resources, “rather than combine the gigabytes to petabytes of data flowing through them into a single AI model or framework, they’ll work independently and heterogeneously to make sense of the data they’re fed.”
To overcome this, ABI proposes multimodal learning, a methodology that could consolidate data “from various sensors and inputs into a single system. Multimodal learning can carry complementary information or trends, which often only become evident when they’re all included in the learning process.”
VB presents a viable example that considers images and text captions. “ If different words are paired with similar images, these words are likely used to describe the same things or objects. Conversely, if some words appear next to different images, this implies these images represent the same object. Given this, it should be possible for an AI model to predict image objects from text descriptions, and indeed, a body of academic literature has proven this to be the case.”
Despite the possible advantages, ABI notes that even tech giants like IBM, Microsoft, Amazon, and Google continue to focus predominantly on unimodal systems. One of the reasons being the challenges such a switch would represent.
Still, the ABI researchers anticipate that “the total number of devices shipped will grow from 3.94 million in 2017 to 514.12 million in 2023, spurred by adoption in the robotics, consumer, health care, and media and entertainment segments.” Among the examples of companies that are already implementing multimodal learning they cite Waymo which is using such approaches to build “ hyper-aware self-driving vehicles,” and Intel Labs, where the company’s engineering team is “investigating techniques for sensor data collation in real-world environments.”
Intel Labs principal engineer Omesh Tickoo explained to VB that “What we did is, using techniques to figure out context such as the time of day, we built a system that tells you when a sensor’s data is not of the highest quality. Given that confidence value, it weighs different sensors against each at different intervals and chooses the right mix to give us the answer we’re looking for.”
VB notes that unimodal learning will remain predominant where it is highly effective – in applications like image recognition and natural language processing. At the same time it predicts that “as electronics become cheaper and compute more scalable, multimodal learning will likely only rise in prominence.”