My IBM

What is multimodal AI?

15 July 2024

Authors

Cole Stryker

Editorial Lead, AI Models

Gather

What is multimodal AI?

Multimodal AI refers to machine learning models capable of processing and integrating information from multiple modalities or types of data. These modalities can include text, images, audio, video and other forms of sensory input.

Unlike traditional AI models that are typically designed to handle a single type of data, multimodal AI combines and analyzes different forms of data inputs to achieve a more comprehensive understanding and generate more robust outputs.

As an example, a multimodal model can receive a photo of a landscape as an input and generate a written summary of that place’s characteristics. Or, it could receive a written summary of a landscape and generate an image based on that description. This ability to work across multiple modalities gives these models powerful capabilities.

OpenAI launched ChatGPT in November 2022, which quickly put generative AI on the map. ChatGPT was unimodal AI, designed to receive text inputs, and generate text outputs by using natural language processing (NLP).

Multimodal AI makes gen AI more robust and useful by allowing multiple types of inputs and outputs. Dall-e, for example, was Open AI’s initial multimodal implementation of its GPT model, but GPT-4o introduced multimodal capabilities to ChatGPT as well.

Multimodal AI models can combine information from various data sources and across media to provide a more comprehensive and nuanced understanding of the data. This allows the AI to make better-informed decisions and generate more accurate outputs.

By leveraging different modalities, multimodal AI systems can achieve higher accuracy and robustness in tasks such as image recognition, language translation and speech recognition. The integration of different types of data helps in capturing more context and reducing ambiguities. Multimodal AI systems are more resilient to noise and missing data. If one modality is unreliable or unavailable, the system can rely on other modalities to maintain performance.

Multimodal AI enhances human-computer interaction by enabling more natural and intuitive interfaces for better user experiences. For instance, virtual assistants can understand and respond to both voice commands and visual cues, making interactions smoother and more efficient.

Imagine a chatbot that can talk to you about your glasses and make sizing recommendations based on a photo that you share with it, or a bird identification app that can recognize images of a particular bird, and confirm its identification by “listening” to an audio clip of its song. AI that can operate across multiple sensory dimensions can give users more meaningful outputs, and more ways to engage with data.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

How multimodal AI works

Artificial intelligence is a rapidly evolving field in which the latest advances in training algorithms to build foundation models are being applied to multimodal research. This discipline saw prior multimodal innovations such as audio-visual speech recognition and multimedia content indexing, which had developed before advances in deep learning and data science paved the way for gen AI.

Today, practitioners use multimodal AI in all kinds of use cases, from analyzing medical images in healthcare to using computer vision alongside other sensory inputs in AI-powered autonomous vehicles.

A 2022 paper out of Carnegie Mellon describes three characteristics of multimodal AI: heterogeneity, connections and interactions.¹ Heterogeneity refers to the diverse qualities, structures and representations of modalities. A text description of an event will be fundamentally different in quality, structure and representation from a photograph of the same event.

Connections refers to the complementary information shared between different modalities. These connections may be reflected in statistical similarities or in semantic correspondence. Lastly, interactions refers to how different modalities interact when they are brought together.

The core engineering challenge for multimodal AI lies in effectively integrating and processing diverse types of data to create models that can leverage the strengths of each modality while overcoming their individual limitations. The paper’s authors also put forth several challenges: representation, alignment, reasoning, generation, transference and quantification.

Representation refers to how to represent and summarize multimodal data to reflect the heterogeneity and interconnections between modalities. Practitioners use specialized neural networks (for example, CNNs for images, transformers for text) to extract features, and employ joint embedding spaces or attention mechanisms for representation learning.
Alignment aims to identify connections and interactions across elements. For example, engineers use techniques for temporal alignment in video and audio data, spatial alignment for images and text.
Reasoning aims to compose knowledge from multimodal evidence, usually through multiple inferential steps.
Generation involves learning a generative process to produce raw modalities that reflect cross-modal interactions, structure and coherence.
Transference aims to transfer knowledge between modalities. Advanced transfer learning techniques and shared embedding spaces allow knowledge to be transferred across modalities.
Quantification involves empirical and theoretical studies to understand multimodal learning to better evaluate their performance within multimodal models.

Multimodal models add a layer of complexity to large language models (LLMs), which are based on transformers, themselves built on an encoder-decoder architecture with an attention mechanism to efficiently process data. Multimodal AI uses data fusion techniques to integrate different modalities. This fusion can be described as early (when modalities are encoded into the model to create a common representation space) mid (when modalities are combined at different preprocessing stages) and late (when multiple models process different modalities and combine the outputs).

Mixture of Experts | 28 March, episode 48

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Trends in multimodal AI

Multimodal AI is a rapidly evolving field, with several key trends shaping its development and application. Here are some of the notable trends:

Unified models

OpenAI’s GPT-4 V(ision), Google’s Gemini, and other unified models are designed to handle text, images and other data types within a single architecture. These models can understand and generate multimodal content seamlessly.

Enhanced cross-modal interaction

Advanced attention mechanisms and transformers are being used to better align and fuse data from different formats, leading to more coherent and contextually accurate outputs.

Real-time multimodal processing

Applications in autonomous driving and augmented reality, for example, require AI to process and integrate data from various sensors (cameras, LIDAR and more.) in real-time to make instantaneous decisions.

Multimodal data augmentation

Researchers are generating synthetic data that combines various modalities (for example., text descriptions with corresponding images) to augment training datasets and improve model performance.

Open source and collaboration

Initiatives like Hugging Face and Google AI are providing open-source AI tools, fostering a collaborative environment for researchers and developers to advance the field.

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Resources

Level up your ML expertise

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Explore IBM Granite

IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

AI in Action Report

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai

Book a live demo

Footnotes

¹ https://arxiv.org/abs/2209.03430, 7 September 2022.

What is multimodal AI?

15 July 2024

Authors

Cole Stryker

What is multimodal AI?

The latest AI News + Insights

How multimodal AI works

Decoding AI: Weekly News Roundup

Trends in multimodal AI

Unified models

Enhanced cross-modal interaction

Real-time multimodal processing

Multimodal data augmentation

Open source and collaboration

Related solutions

Resources

Footnotes

The latest AI News + Insights