My IBM

What is GPT-4o?

24 September 2024

Authors

What is GPT-4o?

GPT-4o is a multimodal and multilingual generative pretrained transformer model released in May 2024 by artificial intelligence (AI) developer OpenAI. It is the flagship large language model (LLM) in the GPT-4 family of AI models, which also includes GPT-4o mini, GPT-4 Turbo and the original GPT-4.

The “o” in GPT-4o stands for omni and highlights that GPT-4o is a multimodal AI model with sound and vision capabilities. This means it can accept prompt datasets as a mixture of text, audio, image and video input. GPT-4o is also capable of image generation. GPT-4o brings multimedia input and output capabilities to the same transformer-powered GPT-4 intelligence fueling the other models in its line.

Revealed in May 2024 as part of OpenAI’s Spring Updates, ChatGPT’s new model appeared to translate spoken language in real time, adapt audio responses to include emotional content and engage in lifelike conversations.

Both GPT-4o and GPT-4o mini support fine-tuning, enabling developers to apply these models toward specific use cases.

What is GPT-4o mini?

GPT-4o mini is a smaller, cost-effective GPT-4o model and the fastest generative AI model in the OpenAI product family. Unlike bigger large language models (LLMs) with tens or hundreds of billions of parameters, GPT-4o mini is a small and lean model. Despite its compact size, it outperforms GPT-3.5 Turbo with similar speed and at approximately 60% of the cost.

Like its larger sibling, GPT-4o mini has multimodal capabilities, supports languages other than English and can handle typical AI tasks such as reasoning, math and coding. As of publishing, GPT-4o mini can process text and image inputs. OpenAI claims that support for audio and video content is pending.

Developers can access GPT-4o mini through application programming interfaces (APIs) at a cost of USD 0.15 per million input tokens and USD 0.60 cents per million output tokens.

How is GPT-4o different from GPT-4 Turbo?

GPT-4o is an “all-in-one” flagship model capable of processing multimodal inputs and outputs on its own as a single neural network. With previous models such as GPT-4 Turbo and GPT-3.5, users would need OpenAI APIs and other supporting models to input and generate varied content types. While GPT-4 Turbo can process image prompts, it is not capable of processing audio without API assistance.

The multimodal nature of GPT-4o is the single biggest breakthrough as compared to GPT-4 Turbo and underpins many of its advancements:

Real-time audio conversations

Tone of voice capabilities

Built-in video processing

Image generation

Greater token efficiency

Real-time audio conversations

GPT-4o’s faster speed and multimodal capabilities allow it to engage conversationally and translate languages at a more humanlike pace than GPT-4 Turbo. In a video demo as part of its release announcement¹, ChatGPT-4o was shown translating in real time between English and Spanish speakers. GPT-4o brings chatbot voice support in over 50 languages.

Because it can process audio inputs by itself, GPT-4o has lower latency—the time taken to produce output from the moment an input is received—than previous models. It responds to audio input in 320 milliseconds, comparable to the typical human response time of 210 milliseconds.

Previous iterations of GPT-4 required multiple models assembled in a pipeline to provide a similar service, increasing its latency to 5.4 seconds. Before GPT-4o, OpenAI’s Whisper API converted audio prompts to text, fed them to GPT-4 Turbo, then text-to-speech (TTS) services converted GPT’s responses back to audio.

GPT-4 Turbo was also a slower model in general. In benchmark tests by AI evaluation organization Artificial Analysis², GPT-4o demonstrated a speed of 110 tokens per second, approximately three times faster than GPT-4 Turbo. In the same test, GPT-4o also outperformed many competitors, including Claude 3 Opus and Gemini 1.5 Pro.

Tone of voice capabilities

The previous Whisper–GPT–TTS pipeline meant that when fielding audio inputs, GPT-4 Turbo was receiving only a text transcription of what was being said. The transcription isolates the content spoken by the user while filtering out background noise, speaker identities and tone of voice, depriving GPT of substantial contextual data.

As a multimodal model, GPT-4o can process the entirety of an audio input and respond appropriately to this additional context. Its improved natural language processing (NLP) enables it to include sentiment, tone and emotional content into its output generation when used in voice mode.

Built-in video processing

GPT-4o handles image and video input in addition to text and audio. GPT-4o can connect to a live camera feed or record a user’s screen, then describe what it sees and answer questions. Users can turn on their smartphone cameras and speak with ChatGPT-4o the same way they would to a friend or colleague.

Previously, the GPT-4 Vision API allowed GPT-4 Turbo to analyze visual content.

Image generation

According to OpenAI’s release statement, GPT-4o offers self-contained image generation abilities. GPT-4 Turbo was only able to generate images by connecting to OpenAI’s DALL-E 3 model. GPT-4 Turbo would field a text-based user prompt, then DALL-E would create the image.

Greater token efficiency

OpenAI improved GPT-4’s tokenization abilities with GPT-4o. Tokenization is the process by which LLMs convert words into data. Each token represents either a whole word or part of one and punctuation. AI models convert words into tokens, then apply complex mathematics to analyze that data.

GPT-4o converts non-Roman languages, such as Chinese, Hindi and Arabic, into tokens much more efficiently than its predecessors. Because OpenAI charges API users per input or output token, GPT-4o’s improved efficiency in non-Roman scripts makes it more cost-effective for use cases in those languages.

What else can GPT-4o do?

In addition to its new multimodal capabilities, GPT-4o brings many of the same functions as seen in prior iterations:

Question-answering

Document analysis and summarization

Sentiment analysis

Data analysis

Coding

Question-answering

With a knowledge cutoff of October 2023, GPT-4o is OpenAI’s most current model in terms of its knowledge base. A knowledge cutoff is the point in time at which a model’s training data concludes. Users can ask GPT-4o questions and receive answers, though at the risk of hallucinations.

Document analysis and summarization

Users can upload files and have ChatGPT analyze and summarize them. GPT-4o’s context window of 128,000 tokens allows it to process large input datasets, though that is not quite as large as that of Claude 3.

The context window of an LLM represents the maximum number of tokens that it can field while maintaining contextual awareness over the entire input sequence. A larger context window permits AI models to intake more complex prompts and include more information from users when generating responses.

GPT-4 has already demonstrated a real-world ability to read documents via optical character recognition (OCR) by using the GPT-4 Vision API.

Sentiment analysis

Sentiment analysis is the process of discerning how a person feels from their statements or behaviors. Researchers use AI models to perform sentiment analysis at scale—for example, when analyzing user reviews. As a multimodal model, GPT-4o can identify sentiment from audio input.

Data analysis

GPT-4o can process complex datasets and distill actionable insights, as seen with self-service analytics platforms. It can also represent data as charts and graphs.

Coding

GPT-4o isn’t the first LLM to have coding abilities, but its multimodal nature can simplify workflows for programmers. Rather than copy and paste code into the user interface, users can share their screens and allow GPT-4o to analyze their code, provide feedback and generate code snippets.

How does GPT-4o perform?

When revealing GPT-4o, OpenAI released LLM benchmarking test results comparing their new model to the competition. GPT-4 Turbo, GPT-4 as it initially released, Anthropic’s Claude 3 Opus, Meta’s Llama 3 400B and Google’s Gemini 1.5 Pro and Gemini 1.0 Ultra were tested alongside GPT-4o on several leading benchmarking tests.

Tests included Massive Multitask Language Understanding (MMLU), which assesses knowledge base and problem-solving skills, and HumanEval, a code-checking test. OpenAI did not test GPT-4o against Google’s models in Graduate-Level Google-Proof Q&A (GPQA), a science exam requiring extensive reasoning. Nor was Llama 3 400B included in Multilingual Grade School Math (MGSM), a math exam given in 10 languages.

GPT-4o placed first in four out of the six tests, coming second to Claude 3 Opus in MGSM and to GPT-4 Turbo in Discrete Reasoning Over Paragraphs (DROP), which tests a model’s ability to reason across several paragraphs.

Overall, GPT-4o did not demonstrate a significant boost in performance over GPT-4 Turbo. Its comparative advancements lie chiefly with its multimodal capabilities and increased speed.

How can people access GPT-4o?

OpenAI is making GPT-4o available for both free and premium users in various locations and products:

ChatGPT Plus, Team and Enterprise

ChatGPT Free

Desktop and mobile app

AI applications

Microsoft Azure OpenAI Studio

ChatGPT Plus, Teams and Enterprise

Subscribers to OpenAI’s premium ChatGPT services have varying levels of access to GPT-4o. ChatGPT Plus users can send up to 80 messages every 3 hours on GPT-4o, with Team users getting more access. Enterprise users have unlimited access to GPT-4o.

ChatGPT Free

Users of the free ChatGPT tier can engage with GPT-4o as the default model behind the chatbot while demand allows. When a free user reaches their limit, they’ll be rolled back to GPT-4o mini³.

Desktop and mobile apps

OpenAI released a macOS desktop app along with GPT-4o that integrates the new model with a user’s MacBook. OpenAI’s ChatGPT mobile apps also grant access to GPT-4o based on a user’s subscription tier.

AI applications

ChatGPT Plus, Teams and Enterprise users can build custom GPTs through which they and others can use specialized versions of GPT-4o tailored to specific use cases. Developers can also connect to GPT-4o through APIs to create other AI tools.

Microsoft Azure OpenAI Studio

GPT-4o and GPT-4o mini are both available in Microsoft’s Azure OpenAI Studio, part of Microsoft’s Azure enterprise AI platform. As of publishing, Copilot continues to offer GPT-4 Turbo, though Microsoft announced in May 2024⁴ that its AI service would soon receive GPT-4o support.

What are the risks of using GPT-4o?

To mitigate potential use in vishing scams, deepfakes and other types of audio fraud, OpenAI limited GPT-4o to four preset voices at launch. But as with any genAI release, GPT-4o is an imperfect model. The risks of its use include:

Hallucinations: Like all generative AI models, GPT-4o can hallucinate—when it detects patterns in its data that don’t exist, causing it to present incorrect information as factual. It is never advisable to use AI-generated content without first vetting or fact-checking it.

Data breaches: OpenAI continues to store user-provided inputs and train its models with them. Users who enter protected data into ChatGPT might find their data appearing in the model’s replies to other users.

Intellectual property violations: OpenAI trains its models on data available online, including copyrighted material such as news articles. Models can inadvertently generate copyrighted content as part of a response.

OpenAI classified GPT-4o as a medium-risk model on their internal scale. Models are evaluated on four threat metrics—cybersecurity, CBRN (chemical, biological, radiological and nuclear threats), persuasion and model autonomy. OpenAI assesses models according to the degree to which they can be used to advance developments in each threat field.

Once assessed, models are given a score ranging from Low to Critical in each field, then assigned an overall score corresponding to their single highest threat ranking after considering any mitigation measures.

GPT-4o earned a score of Low in three out of four categories, with a Medium score in Persuasion. This means it “can create (potentially interactive) content with comparable persuasive effectiveness to typical human-written content.”⁵ For example, malicious actors might theoretically use it to create disinformation articles and social media content.

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

Explore IBM Granite

Discover IBM® Granite™, our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Discover the power of LLMs

Dive into IBM Developer articles, blogs and tutorials to deepen your knowledge of LLMs.

The CEO’s guide to model optimization

Learn how to continually push teams to improve model performance and outpace the competition by using the latest AI techniques and infrastructure.

A differentiated approach to AI foundation models

Explore the value of enterprise-grade foundation models that provide trust, performance and cost-effective benefits to all industries.

Unlock the Power of Generative AI and ML

Learn how to incorporate generative AI, machine learning and foundation models into your business operations for improved performance.

AI in Action 2024

Read about 2,000 organizations we surveyed about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Footnotes

1. Hello GPT-4o, OpenAI, 13 May 2024

2. GPT-4o: Quality, Performance & Price Analysis, Artificial Analysis, 2024

3. Using ChatGPT's Free Tier - FAQ, OpenAI, 2024

4. Microsoft’s Copilot Embraces the Power of OpenAI's New GPT-4o, CNET, 24 May 2024

5. Preparedness  Framework  (Beta), OpenAI, 18 December 2023

What is GPT-4o?

24 September 2024

Authors

Ivan Belcic

Cole Stryker

What is GPT-4o?

What is GPT-4o mini?

How is GPT-4o different from GPT-4 Turbo?

Real-time audio conversations

Tone of voice capabilities

Built-in video processing

Image generation

Greater token efficiency

What else can GPT-4o do?

Question-answering

Document analysis and summarization

Sentiment analysis

Data analysis

Coding

How does GPT-4o perform?

How can people access GPT-4o?

ChatGPT Plus, Teams and Enterprise

ChatGPT Free

Desktop and mobile apps

AI applications

Microsoft Azure OpenAI Studio

What are the risks of using GPT-4o?

Resources

Related solutions

Footnotes