What is GPT-4o?
24 September 2024
Authors
Ivan Belcic Staff writer
Cole Stryker Editorial Lead, AI Models
What is GPT-4o?

GPT-4o is a multimodal and multilingual generative pretrained transformer model released in May 2024 by artificial intelligence (AI) developer OpenAI. It is the flagship large language model (LLM) in the GPT-4 family of AI models, which also includes GPT-4o mini, GPT-4 Turbo and the original GPT-4.

The “o” in GPT-4o stands for omni and highlights that GPT-4o is a multimodal AI model with sound and vision capabilities. This means it can accept prompt datasets as a mixture of text, audio, image and video input. GPT-4o is also capable of image generation. GPT-4o brings multimedia input and output capabilities to the same transformer-powered GPT-4 intelligence fueling the other models in its line. 

Revealed in May 2024 as part of OpenAI’s Spring Updates, ChatGPT’s new model appeared to translate spoken language in real time, adapt audio responses to include emotional content and engage in lifelike conversations. 

Both GPT-4o and GPT-4o mini support fine-tuning, enabling developers to apply these models toward specific use cases. 

What is GPT-4o mini?

GPT-4o mini is a smaller, cost-effective GPT-4o model and the fastest generative AI model in the OpenAI product family. Unlike bigger large language models (LLMs) with tens or hundreds of billions of parameters, GPT-4o mini is a small and lean model. Despite its compact size, it outperforms GPT-3.5 Turbo with similar speed and at approximately 60% of the cost. 

Like its larger sibling, GPT-4o mini has multimodal capabilities, supports languages other than English and can handle typical AI tasks such as reasoning, math and coding. As of publishing, GPT-4o mini can process text and image inputs. OpenAI claims that support for audio and video content is pending. 

Developers can access GPT-4o mini through application programming interfaces (APIs) at a cost of USD 0.15 per million input tokens and USD 0.60 cents per million output tokens.

How is GPT-4o different from GPT-4 Turbo?

GPT-4o is an “all-in-one” flagship model capable of processing multimodal inputs and outputs on its own as a single neural network. With previous models such as GPT-4 Turbo and GPT-3.5, users would need OpenAI APIs and other supporting models to input and generate varied content types. While GPT-4 Turbo can process image prompts, it is not capable of processing audio without API assistance. 

The multimodal nature of GPT-4o is the single biggest breakthrough as compared to GPT-4 Turbo and underpins many of its advancements: 

  • Real-time audio conversations 

  • Tone of voice capabilities 

  • Built-in video processing 

  • Image generation

  • Greater token efficiency

Real-time audio conversations

GPT-4o’s faster speed and multimodal capabilities allow it to engage conversationally and translate languages at a more humanlike pace than GPT-4 Turbo. In a video demo as part of its release announcement1, ChatGPT-4o was shown translating in real time between English and Spanish speakers. GPT-4o brings chatbot voice support in over 50 languages. 

Because it can process audio inputs by itself, GPT-4o has lower latency—the time taken to produce output from the moment an input is received—than previous models. It responds to audio input in 320 milliseconds, comparable to the typical human response time of 210 milliseconds. 

Previous iterations of GPT-4 required multiple models assembled in a pipeline to provide a similar service, increasing its latency to 5.4 seconds. Before GPT-4o, OpenAI’s Whisper API converted audio prompts to text, fed them to GPT-4 Turbo, then text-to-speech (TTS) services converted GPT’s responses back to audio. 

GPT-4 Turbo was also a slower model in general. In benchmark tests by AI evaluation organization Artificial Analysis2, GPT-4o demonstrated a speed of 110 tokens per second, approximately three times faster than GPT-4 Turbo. In the same test, GPT-4o also outperformed many competitors, including Claude 3 Opus and Gemini 1.5 Pro. 

Tone of voice capabilities

The previous Whisper–GPT–TTS pipeline meant that when fielding audio inputs, GPT-4 Turbo was receiving only a text transcription of what was being said. The transcription isolates the content spoken by the user while filtering out background noise, speaker identities and tone of voice, depriving GPT of substantial contextual data. 

As a multimodal model, GPT-4o can process the entirety of an audio input and respond appropriately to this additional context. Its improved natural language processing (NLP) enables it to include sentiment, tone and emotional content into its output generation when used in voice mode.

Built-in video processing

GPT-4o handles image and video input in addition to text and audio. GPT-4o can connect to a live camera feed or record a user’s screen, then describe what it sees and answer questions. Users can turn on their smartphone cameras and speak with ChatGPT-4o the same way they would to a friend or colleague. 

Previously, the GPT-4 Vision API allowed GPT-4 Turbo to analyze visual content.

Image generation

According to OpenAI’s release statement, GPT-4o offers self-contained image generation abilities. GPT-4 Turbo was only able to generate images by connecting to OpenAI’s DALL-E 3 model. GPT-4 Turbo would field a text-based user prompt, then DALL-E would create the image.

Greater token efficiency

OpenAI improved GPT-4’s tokenization abilities with GPT-4o. Tokenization is the process by which LLMs convert words into data. Each token represents either a whole word or part of one and punctuation. AI models convert words into tokens, then apply complex mathematics to analyze that data.

GPT-4o converts non-Roman languages, such as Chinese, Hindi and Arabic, into tokens much more efficiently than its predecessors. Because OpenAI charges API users per input or output token, GPT-4o’s improved efficiency in non-Roman scripts makes it more cost-effective for use cases in those languages.

What else can GPT-4o do?

In addition to its new multimodal capabilities, GPT-4o brings many of the same functions as seen in prior iterations: 

  • Question-answering 

  • Document analysis and summarization

  • Sentiment analysis 

  • Data analysis 

  • Coding

    Question-answering

    With a knowledge cutoff of October 2023, GPT-4o is OpenAI’s most current model in terms of its knowledge base. A knowledge cutoff is the point in time at which a model’s training data concludes. Users can ask GPT-4o questions and receive answers, though at the risk of hallucinations.

    Document analysis and summarization

    Users can upload files and have ChatGPT analyze and summarize them. GPT-4o’s context window of 128,000 tokens allows it to process large input datasets, though that is not quite as large as that of Claude 3.  

    The context window of an LLM represents the maximum number of tokens that it can field while maintaining contextual awareness over the entire input sequence. A larger context window permits AI models to intake more complex prompts and include more information from users when generating responses. 

    GPT-4 has already demonstrated a real-world ability to read documents via optical character recognition (OCR) by using the GPT-4 Vision API.

    Sentiment analysis

    Sentiment analysis is the process of discerning how a person feels from their statements or behaviors. Researchers use AI models to perform sentiment analysis at scale—for example, when analyzing user reviews. As a multimodal model, GPT-4o can identify sentiment from audio input.

    Data analysis

    GPT-4o can process complex datasets and distill actionable insights, as seen with self-service analytics platforms. It can also represent data as charts and graphs.

    Coding

    GPT-4o isn’t the first LLM to have coding abilities, but its multimodal nature can simplify workflows for programmers. Rather than copy and paste code into the user interface, users can share their screens and allow GPT-4o to analyze their code, provide feedback and generate code snippets.

    How does GPT-4o perform?

    When revealing GPT-4o, OpenAI released LLM benchmarking test results comparing their new model to the competition. GPT-4 Turbo, GPT-4 as it initially released, Anthropic’s Claude 3 Opus, Meta’s Llama 3 400B and Google’s Gemini 1.5 Pro and Gemini 1.0 Ultra were tested alongside GPT-4o on several leading benchmarking tests. 

    Tests included Massive Multitask Language Understanding (MMLU), which assesses knowledge base and problem-solving skills, and HumanEval, a code-checking test. OpenAI did not test GPT-4o against Google’s models in Graduate-Level Google-Proof Q&A (GPQA), a science exam requiring extensive reasoning. Nor was Llama 3 400B included in Multilingual Grade School Math (MGSM), a math exam given in 10 languages. 

    GPT-4o placed first in four out of the six tests, coming second to Claude 3 Opus in MGSM and to GPT-4 Turbo in Discrete Reasoning Over Paragraphs (DROP), which tests a model’s ability to reason across several paragraphs. 

    Overall, GPT-4o did not demonstrate a significant boost in performance over GPT-4 Turbo. Its comparative advancements lie chiefly with its multimodal capabilities and increased speed.

    How can people access GPT-4o?

    OpenAI is making GPT-4o available for both free and premium users in various locations and products: 

    • ChatGPT Plus, Team and Enterprise

    • ChatGPT Free 

    • Desktop and mobile app 

    • AI applications 

    • Microsoft Azure OpenAI Studio

    ChatGPT Plus, Teams and Enterprise

    Subscribers to OpenAI’s premium ChatGPT services have varying levels of access to GPT-4o. ChatGPT Plus users can send up to 80 messages every 3 hours on GPT-4o, with Team users getting more access. Enterprise users have unlimited access to GPT-4o.

    ChatGPT Free

    Users of the free ChatGPT tier can engage with GPT-4o as the default model behind the chatbot while demand allows. When a free user reaches their limit, they’ll be rolled back to GPT-4o mini3.

    Desktop and mobile apps

    OpenAI released a macOS desktop app along with GPT-4o that integrates the new model with a user’s MacBook. OpenAI’s ChatGPT mobile apps also grant access to GPT-4o based on a user’s subscription tier.

    AI applications

    ChatGPT Plus, Teams and Enterprise users can build custom GPTs through which they and others can use specialized versions of GPT-4o tailored to specific use cases. Developers can also connect to GPT-4o through APIs to create other AI tools.

    Microsoft Azure OpenAI Studio

    GPT-4o and GPT-4o mini are both available in Microsoft’s Azure OpenAI Studio, part of Microsoft’s Azure enterprise AI platform. As of publishing, Copilot continues to offer GPT-4 Turbo, though Microsoft announced in May 20244 that its AI service would soon receive GPT-4o support.

    What are the risks of using GPT-4o?

    To mitigate potential use in vishing scams, deepfakes and other types of audio fraud, OpenAI limited GPT-4o to four preset voices at launch. But as with any genAI release, GPT-4o is an imperfect model. The risks of its use include: 

    • Hallucinations: Like all generative AI models, GPT-4o can hallucinate—when it detects patterns in its data that don’t exist, causing it to present incorrect information as factual. It is never advisable to use AI-generated content without first vetting or fact-checking it. 
    • Data breaches: OpenAI continues to store user-provided inputs and train its models with them. Users who enter protected data into ChatGPT might find their data appearing in the model’s replies to other users. 
    • Intellectual property violations: OpenAI trains its models on data available online, including copyrighted material such as news articles. Models can inadvertently generate copyrighted content as part of a response. 

    OpenAI classified GPT-4o as a medium-risk model on their internal scale. Models are evaluated on four threat metrics—cybersecurity, CBRN (chemical, biological, radiological and nuclear threats), persuasion and model autonomy. OpenAI assesses models according to the degree to which they can be used to advance developments in each threat field. 

    Once assessed, models are given a score ranging from Low to Critical in each field, then assigned an overall score corresponding to their single highest threat ranking after considering any mitigation measures. 

    GPT-4o earned a score of Low in three out of four categories, with a Medium score in Persuasion. This means it “can create (potentially interactive) content with comparable persuasive effectiveness to typical human-written content.”5 For example, malicious actors might theoretically use it to create disinformation articles and social media content. 

    Footnotes

    1. Hello GPT-4o, OpenAI, 13 May 2024 (link resides outside IBM.com) 

    2. GPT-4o: Quality, Performance & Price Analysis, Artificial Analysis, 2024 (link resides outside IBM.com) 

    3. Using ChatGPT's Free Tier - FAQ, OpenAI, 2024 (link resides outside IBM.com) 

    4. Microsoft’s Copilot Embraces the Power of OpenAI's New GPT-4o, CNET, 24 May 2024 (link resides outside IBM.com) 

    5. Preparedness 
Framework 
(Beta), OpenAI, 18 December 2023 (link resides outside IBM.com) 

    Related solutions Foundation models

    Explore the IBM library of foundation models on the watsonx platform to scale generative AI for your business with confidence.

    Discover watsonx.ai
    Artificial intelligence solutions

    Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

    Explore AI solutions
    AI consulting and services

    Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

    Explore AI services
    Take the next step

    Explore the IBM library of foundation models on the IBM watsonx platform to scale generative AI for your business with confidence.

    Explore watsonx.ai Explore AI solutions