GPT-4o is a multimodal and multilingual generative pretrained transformer model released in May 2024 by artificial intelligence (AI) developer OpenAI. It is the flagship large language model (LLM) in the GPT-4 family of AI models, which also includes GPT-4o mini, GPT-4 Turbo and the original GPT-4.
The “o” in GPT-4o stands for omni and highlights that GPT-4o is a multimodal AI model with sound and vision capabilities. This means it can accept prompt datasets as a mixture of text, audio, image and video input. GPT-4o is also capable of image generation. GPT-4o brings multimedia input and output capabilities to the same transformer-powered GPT-4 intelligence fueling the other models in its line.
Revealed in May 2024 as part of OpenAI’s Spring Updates, ChatGPT’s new model appeared to translate spoken language in real time, adapt audio responses to include emotional content and engage in lifelike conversations.
Both GPT-4o and GPT-4o mini support fine-tuning, enabling developers to apply these models toward specific use cases.
GPT-4o mini is a smaller, cost-effective GPT-4o model and the fastest generative AI model in the OpenAI product family. Unlike bigger large language models (LLMs) with tens or hundreds of billions of parameters, GPT-4o mini is a small and lean model. Despite its compact size, it outperforms GPT-3.5 Turbo with similar speed and at approximately 60% of the cost.
Like its larger sibling, GPT-4o mini has multimodal capabilities, supports languages other than English and can handle typical AI tasks such as reasoning, math and coding. As of publishing, GPT-4o mini can process text and image inputs. OpenAI claims that support for audio and video content is pending.
Developers can access GPT-4o mini through application programming interfaces (APIs) at a cost of USD 0.15 per million input tokens and USD 0.60 cents per million output tokens.
GPT-4o is an “all-in-one” flagship model capable of processing multimodal inputs and outputs on its own as a single neural network. With previous models such as GPT-4 Turbo and GPT-3.5, users would need OpenAI APIs and other supporting models to input and generate varied content types. While GPT-4 Turbo can process image prompts, it is not capable of processing audio without API assistance.
The multimodal nature of GPT-4o is the single biggest breakthrough as compared to GPT-4 Turbo and underpins many of its advancements:
Real-time audio conversations
Tone of voice capabilities
Built-in video processing
Image generation
Greater token efficiency
GPT-4o’s faster speed and multimodal capabilities allow it to engage conversationally and translate languages at a more humanlike pace than GPT-4 Turbo. In a video demo as part of its release announcement1, ChatGPT-4o was shown translating in real time between English and Spanish speakers. GPT-4o brings chatbot voice support in over 50 languages.
Because it can process audio inputs by itself, GPT-4o has lower latency—the time taken to produce output from the moment an input is received—than previous models. It responds to audio input in 320 milliseconds, comparable to the typical human response time of 210 milliseconds.
Previous iterations of GPT-4 required multiple models assembled in a pipeline to provide a similar service, increasing its latency to 5.4 seconds. Before GPT-4o, OpenAI’s Whisper API converted audio prompts to text, fed them to GPT-4 Turbo, then text-to-speech (TTS) services converted GPT’s responses back to audio.
GPT-4 Turbo was also a slower model in general. In benchmark tests by AI evaluation organization Artificial Analysis2, GPT-4o demonstrated a speed of 110 tokens per second, approximately three times faster than GPT-4 Turbo. In the same test, GPT-4o also outperformed many competitors, including Claude 3 Opus and Gemini 1.5 Pro.
The previous Whisper–GPT–TTS pipeline meant that when fielding audio inputs, GPT-4 Turbo was receiving only a text transcription of what was being said. The transcription isolates the content spoken by the user while filtering out background noise, speaker identities and tone of voice, depriving GPT of substantial contextual data.
As a multimodal model, GPT-4o can process the entirety of an audio input and respond appropriately to this additional context. Its improved natural language processing (NLP) enables it to include sentiment, tone and emotional content into its output generation when used in voice mode.
GPT-4o handles image and video input in addition to text and audio. GPT-4o can connect to a live camera feed or record a user’s screen, then describe what it sees and answer questions. Users can turn on their smartphone cameras and speak with ChatGPT-4o the same way they would to a friend or colleague.
Previously, the GPT-4 Vision API allowed GPT-4 Turbo to analyze visual content.
According to OpenAI’s release statement, GPT-4o offers self-contained image generation abilities. GPT-4 Turbo was only able to generate images by connecting to OpenAI’s DALL-E 3 model. GPT-4 Turbo would field a text-based user prompt, then DALL-E would create the image.
OpenAI improved GPT-4’s tokenization abilities with GPT-4o. Tokenization is the process by which LLMs convert words into data. Each token represents either a whole word or part of one and punctuation. AI models convert words into tokens, then apply complex mathematics to analyze that data.
GPT-4o converts non-Roman languages, such as Chinese, Hindi and Arabic, into tokens much more efficiently than its predecessors. Because OpenAI charges API users per input or output token, GPT-4o’s improved efficiency in non-Roman scripts makes it more cost-effective for use cases in those languages.
In addition to its new multimodal capabilities, GPT-4o brings many of the same functions as seen in prior iterations:
Question-answering
Document analysis and summarization
Sentiment analysis
Data analysis
Coding
With a knowledge cutoff of October 2023, GPT-4o is OpenAI’s most current model in terms of its knowledge base. A knowledge cutoff is the point in time at which a model’s training data concludes. Users can ask GPT-4o questions and receive answers, though at the risk of hallucinations.
Users can upload files and have ChatGPT analyze and summarize them. GPT-4o’s context window of 128,000 tokens allows it to process large input datasets, though that is not quite as large as that of Claude 3.
The context window of an LLM represents the maximum number of tokens that it can field while maintaining contextual awareness over the entire input sequence. A larger context window permits AI models to intake more complex prompts and include more information from users when generating responses.
GPT-4 has already demonstrated a real-world ability to read documents via optical character recognition (OCR) by using the GPT-4 Vision API.
Sentiment analysis is the process of discerning how a person feels from their statements or behaviors. Researchers use AI models to perform sentiment analysis at scale—for example, when analyzing user reviews. As a multimodal model, GPT-4o can identify sentiment from audio input.
GPT-4o can process complex datasets and distill actionable insights, as seen with self-service analytics platforms. It can also represent data as charts and graphs.
GPT-4o isn’t the first LLM to have coding abilities, but its multimodal nature can simplify workflows for programmers. Rather than copy and paste code into the user interface, users can share their screens and allow GPT-4o to analyze their code, provide feedback and generate code snippets.
When revealing GPT-4o, OpenAI released LLM benchmarking test results comparing their new model to the competition. GPT-4 Turbo, GPT-4 as it initially released, Anthropic’s Claude 3 Opus, Meta’s Llama 3 400B and Google’s Gemini 1.5 Pro and Gemini 1.0 Ultra were tested alongside GPT-4o on several leading benchmarking tests.
Tests included Massive Multitask Language Understanding (MMLU), which assesses knowledge base and problem-solving skills, and HumanEval, a code-checking test. OpenAI did not test GPT-4o against Google’s models in Graduate-Level Google-Proof Q&A (GPQA), a science exam requiring extensive reasoning. Nor was Llama 3 400B included in Multilingual Grade School Math (MGSM), a math exam given in 10 languages.
GPT-4o placed first in four out of the six tests, coming second to Claude 3 Opus in MGSM and to GPT-4 Turbo in Discrete Reasoning Over Paragraphs (DROP), which tests a model’s ability to reason across several paragraphs.
Overall, GPT-4o did not demonstrate a significant boost in performance over GPT-4 Turbo. Its comparative advancements lie chiefly with its multimodal capabilities and increased speed.
OpenAI is making GPT-4o available for both free and premium users in various locations and products:
ChatGPT Plus, Team and Enterprise
ChatGPT Free
Desktop and mobile app
AI applications
Microsoft Azure OpenAI Studio
Subscribers to OpenAI’s premium ChatGPT services have varying levels of access to GPT-4o. ChatGPT Plus users can send up to 80 messages every 3 hours on GPT-4o, with Team users getting more access. Enterprise users have unlimited access to GPT-4o.
Users of the free ChatGPT tier can engage with GPT-4o as the default model behind the chatbot while demand allows. When a free user reaches their limit, they’ll be rolled back to GPT-4o mini3.
OpenAI released a macOS desktop app along with GPT-4o that integrates the new model with a user’s MacBook. OpenAI’s ChatGPT mobile apps also grant access to GPT-4o based on a user’s subscription tier.
ChatGPT Plus, Teams and Enterprise users can build custom GPTs through which they and others can use specialized versions of GPT-4o tailored to specific use cases. Developers can also connect to GPT-4o through APIs to create other AI tools.
GPT-4o and GPT-4o mini are both available in Microsoft’s Azure OpenAI Studio, part of Microsoft’s Azure enterprise AI platform. As of publishing, Copilot continues to offer GPT-4 Turbo, though Microsoft announced in May 20244 that its AI service would soon receive GPT-4o support.
To mitigate potential use in vishing scams, deepfakes and other types of audio fraud, OpenAI limited GPT-4o to four preset voices at launch. But as with any genAI release, GPT-4o is an imperfect model. The risks of its use include:
OpenAI classified GPT-4o as a medium-risk model on their internal scale. Models are evaluated on four threat metrics—cybersecurity, CBRN (chemical, biological, radiological and nuclear threats), persuasion and model autonomy. OpenAI assesses models according to the degree to which they can be used to advance developments in each threat field.
Once assessed, models are given a score ranging from Low to Critical in each field, then assigned an overall score corresponding to their single highest threat ranking after considering any mitigation measures.
GPT-4o earned a score of Low in three out of four categories, with a Medium score in Persuasion. This means it “can create (potentially interactive) content with comparable persuasive effectiveness to typical human-written content.”5 For example, malicious actors might theoretically use it to create disinformation articles and social media content.
1. Hello GPT-4o, OpenAI, 13 May 2024 (link resides outside IBM.com)
2. GPT-4o: Quality, Performance & Price Analysis, Artificial Analysis, 2024 (link resides outside IBM.com)
3. Using ChatGPT's Free Tier - FAQ, OpenAI, 2024 (link resides outside IBM.com)
4. Microsoft’s Copilot Embraces the Power of OpenAI's New GPT-4o, CNET, 24 May 2024 (link resides outside IBM.com)
5. Preparedness Framework (Beta), OpenAI, 18 December 2023 (link resides outside IBM.com)
Guide: How to choose the right AI foundation model
Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence.
IBM Granite is a family of artificial intelligence (AI) models built for business, to help drive trust and scalability in AI-driven applications. Open source and proprietary Granite models are available today.
IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business.