Meta’s Llama 3.2 models now available on watsonx, including multimodal 11B and 90B models

Authors

Maryam Ashoori

Head of Product, watsonx.ai

IBM

Dave Bergmann

Senior Writer, AI Models

IBM

IBM is announcing the availability of multiple Llama 3.2 models on watsonx.ai, IBM’s enterprise studio for AI developers, following the launch of the Llama 3.2 collection of pretrained and instruction tuned multilingual large language models (LLMs) at MetaConnect earlier today.

Most notably, Llama 3.2 marks Meta’s first foray into multimodal AI: the release includes two models, in sizes of 11B and 90B, that can take images as input. The instruction-tuned Llama 3.2 90B Vision and 11B Vision models are immediately available in watsonx.ai through SaaS.

Also arriving in watsonx.ai are the smallest Llama models to date: two text-in, text-out LLMs in sizes of 1B and 3B. All Llama 3.2 models support long context length (up to 128K tokens) and are optimized for fast and efficient inference with grouped query attention. Meta has specified that English, German, French, Italian, Portuguese, Hindi, Spanish and Thai are officially supported, but notes that Llama 3.2 was trained on—and developers may fine-tune Llama 3.2 models for—additional languages beyond those 8.

These latest additions from Meta join the expansive library of foundation models available in watsonx.ai, in keeping with IBM’s open, multi-model strategy toward generative AI

“By making our latest Llama 3.2 models available on watsonx, a much wider pool of enterprises can benefit from these innovations and deploy our newest models hassle-free, on their own terms, and across hybrid cloud environments,” says Ahmad Al-Dahle, Head of GenAI, Meta. “At Meta we believe it’s essential to tailor AI solutions to each organization’s specific needs and empower them to scale Llama deployments with ease. Our partnership with IBM, an organization that shares in our commitment to openness, safety, security, trust and transparency, enables us to do just that.”

Read on for more details of the Llama 3.2 collection, including new multimodal capabilities, new deployment opportunities on mobile and other edge devices, updated safety features and more.

The first multimodal Llama models

Unlike their text-only LLM predecessors in the Llama series, Llama 3.2 11B and Llama 3.2 90B have expanded their capabilities to include image-in, text-out use cases such as document-level understanding, interpretation of charts and graphs and captioning of images. Developers now have access to powerful visual reasoning models that approach the advanced capabilities of closed models, while offering the full flexibility and customization of open models.

The new multimodal Llama 3.2 vision LLMs can reason on high resolution images up to 1120x1120 pixels, enabling their use for computer vision tasks including classification, object detection and identification, image-to-text transcription (including handwriting) through optical character recognition (OCR), contextual Q&A, data extraction and processing, image comparison and personal visual assistance.

This approach to multimodality has 3 key benefits.

Simplified fine-tuning: It leaves the parameters of the base LLM unchanged, preserving the model’s general knowledge and ensuring that the remarkable advancements achieved with the Llama 3.1 release remain in place: Llama 3.2 11B and 90B demonstrated slight improvements across math, reasoning and general knowledge benchmarks compared to their 8B and 70B Llama 3.1 counterparts. This also presumably simplifies the process of fine-tuning the multimodal Llama 3.2 models, allowing for customization of language performance without fear of unintended consequences to image reasoning capabilities.
Highly efficient training: It’s remarkably efficient compared to typical multimodal training paradigms, which often must update many billions of model parameters. The researchers behind LLaMa-Adapter V2, for example, noted that their image-focused parameters account for only 0.04% of the entire model. Pre-training Llama 3.2-Vision on a dataset of 6 billion image and text pairs required a combined 2.02M GPU hours for both models—far less compute than the 7.0M GPU hours spent pre-training Llama 3.1 70B alone.
Speed and cost efficiencies: It allows Llama 3.2 models to tap additional compute resources for image reasoning only when the input calls for it. This underlines Meta’s commitment to industry-leading output speed and cost-efficiency, empowering efficient large-scale deployments and real-time applications for time-sensitive use cases, such as chatbots and dynamic content generation.

Both models perform well across common multimodal benchmarks for vision language models (VLMs), approaching top scores for open models and often rivaling state-of-the-art closed models. For example, the instruction tuned Llama 3.2 90B-Vision matches OpenAI’s GPT-4o on chart understanding (ChartQA) and beats both Anthropic’s Claude 3 Opus and Google’s Gemini 1.5 Pro on interpreting scientific diagrams (AI2D).¹

Likewise, Llama 3.2 11B-Vision achieved competitive benchmark scores for its weight class, beating Gemini 1.5 Flash 8B on document visual Q&A (DocVQA), topping Claude 3 Haiku and Claude 3 Sonnet on AI2D, ChartQA and visual mathematical reasoning (MathVista), and keeping pace with Pixtral 12B and Qwen2-VL 7B on general visual Q&A (VQAv2).²

Lightweight Llama models that can run almost anywhere

The Llama 3.2 collection also includes variants with 1B and 3B parameters, representing the smallest Llama models yet.

Their small model size and correspondingly modest compute and memory requirements enable Llama to be run locally on most hardware, including on mobile and other edge devices. This gives Llama 3.2 1B and Llama 3.2 3B the potential to precipitate a wave of innovation in local applications and agentic AI. While there many advantages to such compact, performant models, perhaps the two most important are:

The opportunity to run locally at very low latency on even modest hardware, including smartphones
The corresponding ability to preserve user privacy and sidestep cybersecurity concerns by eliminating the need to transmit sensitive proprietary or personal data to off-device servers

Running locally, these lightweight Llama 3.2 models can serve as cost-effective agents to coordinate on-device applications including RAG, multilingual summarization and delegation of subtasks. They can also be used to lower the cost of implementing safety models like Llama Guard—a new, multimodal version of which is also included in today’s release from Meta and available on watsonx.

Both of the new lightweight Llama models punch above their weight on performance benchmarks, particularly those related to key agentic AI tasks. For example, Llama 3.2 3B matches the larger Llama 3.1 8B on tool use (BFCL v2) and exceeds it on summarization (TLDR9+), with the 1B likewise rivaling both on summarization and re-writing tasks. In a telling indication of how far open LLMs have come in a short amount of time, Llama 3.2 3B significantly outperformed the original GPT-4 on the MATH benchmark.

Why use Llama models in IBM watsonx?

With the ever-increasing availability of powerful AI models, building a competitive edge using off-the-shelf solutions alone will become increasingly difficult. Llama open models rival even the most powerful models on performance benchmarks while also offering customization, transparency and flexibility not possible with closed models.

Customization: by providing direct access to its model weights and architecture, Llama enables developers to tailor models to your specific needs, domains and use cases. The pretrained Llama 3.2 models will be available for fine-tuning or continual pre-training in the watsonx Tuning Studio in the coming weeks.
Flexibility: Like all foundation models available in IBM watsonx, Llama models can be deployed in any environment, whether cloud, on-premises or hybrid environments. Unlike other providers who confine users to a specific public cloud, watsonx allows developers to take full advantage of Llama 3.2’s flexibility.
Safety and security: Meta’s responsible approach to AI innovation continues to yield a wide array of important safeguards. The Llama 3.2 collection includes an updated Llama Guard 3 model, Llama-Guard-3-11B-Vision, that provides a set of input-output inference guardrails compatible with Llama’s new multimodal functionality. IBM watsonx further enhances responsible, accountable AI with dynamic AI guardrails and robust security, privacy and safety measures.
Transparency: Unlike the black box offered by closed source models, the Llama ecosystem provides full visibility, control and explainability—especially when paired with IBM’s robust AI governance solution to manage and monitor AI across an entire organization.

Getting started with Llama 3.2

Support for Llama 3.2 is part of IBM’s commitment to furthering open source innovation in AI and providing our clients with access to best-in-class open models in watsonx, including both third party models and the IBM Granite model family.

IBM watsonx helps enable clients to truly customize implementation of open source models like Llama 3.2, from full flexibility of deployment environments to intuitive workflows for fine-tuning, prompt engineering and integration with enterprise applications. Readily build custom AI applications for your business, manage all data sources, and accelerate responsible AI workflows—all on one platform.

The following models are available today in IBM watsonx.ai:

Llama-3.2-90B-Vision-Instruct (text and image input)
Llama-3.2-11B-Vision-Instruct (text and image input)
Llama-3.2-3B-Instruct (text only)
Llama-3.2-1B-Instruct (text only)
Llama-Guard-3-11B-Vision (text and image input)

They will be joined by the pre-trained Llama 3.2 models in the coming weeks. The “-Instruct” models have all undergone supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for better alignment with common use cases and human preferences for helpfulness and safety, respectively.

Try out Llama 3.2 in watsonx.ai™ →

Footnotes

^{1 Cited benchmark evaluations for proprietary models are drawn from self-reported figures from Anthropic on 20 June, 2024 (for Claude 3.5 Sonnet and Claude 3 Opus) and 4 March, 2024 (for Claude 3 Sonnet and Haiku), from OpenAI on 13 May, 2024 (for GPT models) and Google Deepmind in May and September 2024 (for Gemini models). AI2D testing for Gemini 1.5 Pro was reported by Anthropic.}

^{2 Cited benchmark evaluations for Pixtral and Qwen-VL are drawn from numbers reported by Mistral AI.}

A next-generation enterprise studio for AI builders

Train, tune, validate and deploy foundation models in IBM’s enterprise AI studio. Build AI applications in a fraction of the time, with a fraction of the data.

Streamline the AI application development process with IBM watson.ai

Meta Llama 3.2 models now available on watsonx, including multimodal 11B and 90B models