25 September, 2024
IBM is announcing the availability of multiple Llama 3.2 models on watsonx.ai, IBM’s enterprise studio for AI developers, following the launch of the Llama 3.2 collection of pretrained and instruction tuned multilingual large language models (LLMs) at MetaConnect earlier today.
Most notably, Llama 3.2 marks Meta’s first foray into multimodal AI: the release includes two models, in sizes of 11B and 90B, that can take images as input. The instruction-tuned Llama 3.2 90B Vision and 11B Vision models are immediately available in watsonx.ai through SaaS.
Also arriving in watsonx.ai are the smallest Llama models to date: two text-in, text-out LLMs in sizes of 1B and 3B. All Llama 3.2 models support long context length (up to 128K tokens) and are optimized for fast and efficient inference with grouped query attention. Meta has specified that English, German, French, Italian, Portuguese, Hindi, Spanish and Thai are officially supported, but notes that Llama 3.2 was trained on—and developers may fine-tune Llama 3.2 models for—additional languages beyond those 8.
These latest additions from Meta join the expansive library of foundation models available in watsonx.ai, in keeping with IBM’s open, multi-model strategy toward generative AI
“By making our latest Llama 3.2 models available on watsonx, a much wider pool of enterprises can benefit from these innovations and deploy our newest models hassle-free, on their own terms, and across hybrid cloud environments,” says Ahmad Al-Dahle, Head of GenAI, Meta. “At Meta we believe it’s essential to tailor AI solutions to each organization’s specific needs and empower them to scale Llama deployments with ease. Our partnership with IBM, an organization that shares in our commitment to openness, safety, security, trust and transparency, enables us to do just that.”
Read on for more details of the Llama 3.2 collection, including new multimodal capabilities, new deployment opportunities on mobile and other edge devices, updated safety features and more.
Unlike their text-only LLM predecessors in the Llama series, Llama 3.2 11B and Llama 3.2 90B have expanded their capabilities to include image-in, text-out use cases such as document-level understanding, interpretation of charts and graphs and captioning of images. Developers now have access to powerful visual reasoning models that approach the advanced capabilities of closed models, while offering the full flexibility and customization of open models.
The new multimodal Llama 3.2 vision LLMs can reason on high resolution images up to 1120x1120 pixels, enabling their use for computer vision tasks including classification, object detection and identification, image-to-text transcription (including handwriting) through optical character recognition (OCR), contextual Q&A, data extraction and processing, image comparison and personal visual assistance.
This approach to multimodality has 3 key benefits.
Both models perform well across common multimodal benchmarks for vision language models (VLMs), approaching top scores for open models and often rivaling state-of-the-art closed models. For example, the instruction tuned Llama 3.2 90B-Vision matches OpenAI’s GPT-4o on chart understanding (ChartQA) and beats both Anthropic’s Claude 3 Opus and Google’s Gemini 1.5 Pro on interpreting scientific diagrams (AI2D).1
Likewise, Llama 3.2 11B-Vision achieved competitive benchmark scores for its weight class, beating Gemini 1.5 Flash 8B on document visual Q&A (DocVQA), topping Claude 3 Haiku and Claude 3 Sonnet on AI2D, ChartQA and visual mathematical reasoning (MathVista), and keeping pace with Pixtral 12B and Qwen2-VL 7B on general visual Q&A (VQAv2).2
The Llama 3.2 collection also includes variants with 1B and 3B parameters, representing the smallest Llama models yet.
Their small model size and correspondingly modest compute and memory requirements enable Llama to be run locally on most hardware, including on mobile and other edge devices. This gives Llama 3.2 1B and Llama 3.2 3B the potential to precipitate a wave of innovation in local applications and agentic AI. While there many advantages to such compact, performant models, perhaps the two most important are:
Running locally, these lightweight Llama 3.2 models can serve as cost-effective agents to coordinate on-device applications including RAG, multilingual summarization and delegation of subtasks. They can also be used to lower the cost of implementing safety models like Llama Guard—a new, multimodal version of which is also included in today’s release from Meta and available on watsonx.
Both of the new lightweight Llama models punch above their weight on performance benchmarks, particularly those related to key agentic AI tasks. For example, Llama 3.2 3B matches the larger Llama 3.1 8B on tool use (BFCL v2) and exceeds it on summarization (TLDR9+), with the 1B likewise rivaling both on summarization and re-writing tasks. In a telling indication of how far open LLMs have come in a short amount of time, Llama 3.2 3B significantly outperformed the original GPT-4 on the MATH benchmark.
With the ever-increasing availability of powerful AI models, building a competitive edge using off-the-shelf solutions alone will become increasingly difficult. Llama open models rival even the most powerful models on performance benchmarks while also offering customization, transparency and flexibility not possible with closed models.
Support for Llama 3.2 is part of IBM’s commitment to furthering open source innovation in AI and providing our clients with access to best-in-class open models in watsonx, including both third party models and the IBM Granite model family.
IBM watsonx helps enable clients to truly customize implementation of open source models like Llama 3.2, from full flexibility of deployment environments to intuitive workflows for fine-tuning, prompt engineering and integration with enterprise applications. Readily build custom AI applications for your business, manage all data sources, and accelerate responsible AI workflows—all on one platform.
The following models are available today in IBM watsonx.ai:
They will be joined by the pre-trained Llama 3.2 models in the coming weeks. The “-Instruct” models have all undergone supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) for better alignment with common use cases and human preferences for helpfulness and safety, respectively.
1 Cited benchmark evaluations for proprietary models are drawn from self-reported figures from Anthropic on 20 June, 2024 (for Claude 3.5 Sonnet and Claude 3 Opus) and 4 March, 2024 (for Claude 3 Sonnet and Haiku), from OpenAI on 13 May, 2024 (for GPT models) and Google Deepmind in May and September 2024 (for Gemini models). AI2D testing for Gemini 1.5 Pro was reported by Anthropic.
2 Cited benchmark evaluations for Pixtral and Qwen-VL are drawn from numbers reported by Mistral AI.