Published: 15 July 2024
Contributors: Cole Stryker
Multimodal AI refers to machine learning models capable of processing and integrating information from multiple modalities or types of data. These modalities can include text, images, audio, video and other forms of sensory input.
Unlike traditional AI models that are typically designed to handle a single type of data, multimodal AI combines and analyzes different forms of data inputs to achieve a more comprehensive understanding and generate more robust outputs.
As an example, a multimodal model can receive a photo of a landscape as an input and generate a written summary of that place’s characteristics. Or, it could receive a written summary of a landscape and generate an image based on that description. This ability to work across multiple modalities gives these models powerful capabilities.
OpenAI launched ChatGPT in November 2022, which quickly put generative AI on the map. ChatGPT was unimodal AI, designed to receive text inputs, and generate text outputs by using natural language processing (NLP).
Multimodal AI makes gen AI more robust and useful by allowing multiple types of inputs and outputs. Dall-e, for example, was Open AI’s initial multimodal implementation of its GPT model, but GPT-4o introduced multimodal capabilities to ChatGPT as well.
Multimodal AI models can combine information from various data sources and across media to provide a more comprehensive and nuanced understanding of the data. This allows the AI to make better-informed decisions and generate more accurate outputs.
By leveraging different modalities, multimodal AI systems can achieve higher accuracy and robustness in tasks such as image recognition, language translation and speech recognition. The integration of different types of data helps in capturing more context and reducing ambiguities. Multimodal AI systems are more resilient to noise and missing data. If one modality is unreliable or unavailable, the system can rely on other modalities to maintain performance.
Multimodal AI enhances human-computer interaction by enabling more natural and intuitive interfaces for better user experiences. For instance, virtual assistants can understand and respond to both voice commands and visual cues, making interactions smoother and more efficient.
Imagine a chatbot that can talk to you about your glasses and make sizing recommendations based on a photo that you share with it, or a bird identification app that can recognize images of a particular bird, and confirm its identification by “listening” to an audio clip of its song. AI that can operate across multiple sensory dimensions can give users more meaningful outputs, and more ways to engage with data.
Discover the power of integrating a data lakehouse strategy into your data architecture, including enhancements to scale AI and cost optimization opportunities.
Mixture of Experts podcast: Our multimodal future
Generative AI insights
Artificial intelligence is a rapidly evolving field in which the latest advances in training algorithms to build foundation models are being applied to multimodal research. This discipline saw prior multimodal innovations such as audio-visual speech recognition and multimedia content indexing, which had developed before advances in deep learning and data science paved the way for gen AI.
Today, practitioners use multimodal AI in all kinds of use cases, from analyzing medical images in healthcare to using computer vision alongside other sensory inputs in AI-powered autonomous vehicles.
A 2022 paper out of Carnegie Mellon describes three characteristics of multimodal AI: heterogeneity, connections and interactions.1 Heterogeneity refers to the diverse qualities, structures and representations of modalities. A text description of an event will be fundamentally different in quality, structure and representation from a photograph of the same event.
Connections refers to the complementary information shared between different modalities. These connections may be reflected in statistical similarities or in semantic correspondence. Lastly, interactions refers to how different modalities interact when they are brought together.
The core engineering challenge for multimodal AI lies in effectively integrating and processing diverse types of data to create models that can leverage the strengths of each modality while overcoming their individual limitations. The paper’s authors also put forth several challenges: representation, alignment, reasoning, generation, transference and quantification.
Representation refers to how to represent and summarize multimodal data to reflect the heterogeneity and interconnections between modalities. Practitioners use specialized neural networks (for example, CNNs for images, transformers for text) to extract features, and employ joint embedding spaces or attention mechanisms for representation learning.
Alignment aims to identify connections and interactions across elements. For example, engineers use techniques for temporal alignment in video and audio data, spatial alignment for images and text.
Reasoning aims to compose knowledge from multimodal evidence, usually through multiple inferential steps.
Generation involves learning a generative process to produce raw modalities that reflect cross-modal interactions, structure and coherence.
Transference aims to transfer knowledge between modalities. Advanced transfer learning techniques and shared embedding spaces allow knowledge to be transferred across modalities.
Quantification involves empirical and theoretical studies to understand multimodal learning to better evaluate their performance within multimodal models.
Multimodal models add a layer of complexity to large language models (LLMs), which are based on transformers, themselves built on an encoder-decoder architecture with an attention mechanism to efficiently process data. Multimodal AI uses data fusion techniques to integrate different modalities. This fusion can be described as early (when modalities are encoded into the model to create a common representation space) mid (when modalities are combined at different preprocessing stages) and late (when multiple models process different modalities and combine the outputs).
Multimodal AI is a rapidly evolving field, with several key trends shaping its development and application. Here are some of the notable trends:
OpenAI’s GPT-4 V(ision), Google’s Gemini, and other unified models are designed to handle text, images and other data types within a single architecture. These models can understand and generate multimodal content seamlessly.
Advanced attention mechanisms and transformers are being used to better align and fuse data from different formats, leading to more coherent and contextually accurate outputs.
Applications in autonomous driving and augmented reality, for example, require AI to process and integrate data from various sensors (cameras, LIDAR and more.) in real-time to make instantaneous decisions.
Researchers are generating synthetic data that combines various modalities (for example., text descriptions with corresponding images) to augment training datasets and improve model performance.
Initiatives like Hugging Face and Google AI are providing open-source AI tools, fostering a collaborative environment for researchers and developers to advance the field.
Our research includes advanced computer vision techniques that enable the automatic extraction of diagnostically relevant features in multimodal healthcare images.
IBM and NASA are currently experimenting with model architectures and techniques to integrate these varying time and spatial scales into one multimodal model.
This survey navigates the current landscape of multimodal ML, focusing on its profound impact on medical image analysis and clinical decision support systems.
Explore our centralized hub for AI research, from basic principles to emerging research to salient issues and advancements.
Learn how IBM developing generative foundation models that are trustworthy, energy efficient, and portable.
A beginner course: in two hours, learn the basics of AI and build and test your first machine learning model using Python and scikit-learn.
1 https://arxiv.org/abs/2209.03430 (link resides outside ibm.com), 7 September 2022.