What is multimodal AI?

15 July 2024

Authors

Cole Stryker

Editorial Lead, AI Models

What is multimodal AI?

Multimodal AI refers to machine learning models capable of processing and integrating information from multiple modalities or types of data. These modalities can include text, images, audio, video and other forms of sensory input.

Unlike traditional AI models that are typically designed to handle a single type of data, multimodal AI combines and analyzes different forms of data inputs to achieve a more comprehensive understanding and generate more robust outputs.

As an example, a multimodal model can receive a photo of a landscape as an input and generate a written summary of that place’s characteristics. Or, it could receive a written summary of a landscape and generate an image based on that description. This ability to work across multiple modalities gives these models powerful capabilities.

OpenAI launched ChatGPT in November 2022, which quickly put generative AI on the map. ChatGPT was unimodal AI, designed to receive text inputs, and generate text outputs by using natural language processing (NLP).

Multimodal AI makes gen AI more robust and useful by allowing multiple types of inputs and outputs. Dall-e, for example, was Open AI’s initial multimodal implementation of its GPT model, but GPT-4o introduced multimodal capabilities to ChatGPT as well.

Multimodal AI models can combine information from various data sources and across media to provide a more comprehensive and nuanced understanding of the data. This allows the AI to make better-informed decisions and generate more accurate outputs.

By leveraging different modalities, multimodal AI systems can achieve higher accuracy and robustness in tasks such as image recognition, language translation and speech recognition. The integration of different types of data helps in capturing more context and reducing ambiguities. Multimodal AI systems are more resilient to noise and missing data. If one modality is unreliable or unavailable, the system can rely on other modalities to maintain performance.

Multimodal AI enhances human-computer interaction by enabling more natural and intuitive interfaces for better user experiences. For instance, virtual assistants can understand and respond to both voice commands and visual cues, making interactions smoother and more efficient.

Imagine a chatbot that can talk to you about your glasses and make sizing recommendations based on a photo that you share with it, or a bird identification app that can recognize images of a particular bird, and confirm its identification by “listening” to an audio clip of its song. AI that can operate across multiple sensory dimensions can give users more meaningful outputs, and more ways to engage with data.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

How multimodal AI works

Artificial intelligence is a rapidly evolving field in which the latest advances in training algorithms to build foundation models are being applied to multimodal research. This discipline saw prior multimodal innovations such as audio-visual speech recognition and multimedia content indexing, which had developed before advances in deep learning and data science paved the way for gen AI.

Today, practitioners use multimodal AI in all kinds of use cases, from analyzing medical images in healthcare to using computer vision alongside other sensory inputs in AI-powered autonomous vehicles.

A 2022 paper out of Carnegie Mellon describes three characteristics of multimodal AI: heterogeneity, connections and interactions.1 Heterogeneity refers to the diverse qualities, structures and representations of modalities. A text description of an event will be fundamentally different in quality, structure and representation from a photograph of the same event.

Connections refers to the complementary information shared between different modalities. These connections may be reflected in statistical similarities or in semantic correspondence. Lastly, interactions refers to how different modalities interact when they are brought together.

The core engineering challenge for multimodal AI lies in effectively integrating and processing diverse types of data to create models that can leverage the strengths of each modality while overcoming their individual limitations. The paper’s authors also put forth several challenges: representation, alignment, reasoning, generation, transference and quantification.

  • Representation refers to how to represent and summarize multimodal data to reflect the heterogeneity and interconnections between modalities. Practitioners use specialized neural networks (for example, CNNs for images, transformers for text) to extract features, and employ joint embedding spaces or attention mechanisms for representation learning.

  • Alignment aims to identify connections and interactions across elements. For example, engineers use techniques for temporal alignment in video and audio data, spatial alignment for images and text.

  • Reasoning aims to compose knowledge from multimodal evidence, usually through multiple inferential steps.

  • Generation involves learning a generative process to produce raw modalities that reflect cross-modal interactions, structure and coherence.

  • Transference aims to transfer knowledge between modalities. Advanced transfer learning techniques and shared embedding spaces allow knowledge to be transferred across modalities.

  • Quantification involves empirical and theoretical studies to understand multimodal learning to better evaluate their performance within multimodal models.

Multimodal models add a layer of complexity to large language models (LLMs), which are based on transformers, themselves built on an encoder-decoder architecture with an attention mechanism to efficiently process data. Multimodal AI uses data fusion techniques to integrate different modalities. This fusion can be described as early (when modalities are encoded into the model to create a common representation space) mid (when modalities are combined at different preprocessing stages) and late (when multiple models process different modalities and combine the outputs).

Mixture of Experts | Podcast

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Trends in multimodal AI

Multimodal AI is a rapidly evolving field, with several key trends shaping its development and application. Here are some of the notable trends:

Unified models

OpenAI’s GPT-4 V(ision), Google’s Gemini, and other unified models are designed to handle text, images and other data types within a single architecture. These models can understand and generate multimodal content seamlessly.

Enhanced cross-modal interaction

Advanced attention mechanisms and transformers are being used to better align and fuse data from different formats, leading to more coherent and contextually accurate outputs.

Real-time multimodal processing

Applications in autonomous driving and augmented reality, for example, require AI to process and integrate data from various sensors (cameras, LIDAR and more.) in real-time to make instantaneous decisions.

Multimodal data augmentation

Researchers are generating synthetic data that combines various modalities (for example., text descriptions with corresponding images) to augment training datasets and improve model performance.

Open source and collaboration

Initiatives like Hugging Face and Google AI are providing open-source AI tools, fostering a collaborative environment for researchers and developers to advance the field.

Related solutions IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo
Footnotes

1 https://arxiv.org/abs/2209.03430, 7 September 2022.