IBM Granite 3.0: open, state-of-the-art enterprise models

Author

Kate Soule

Program Director, Data and Model Factory, IBM

Dave Bergmann

Writer, Data & AI

Today, we’re excited to release IBM Granite 3.0, the third generation of the Granite series of large language models (LLMs) and complementary tools. Reflecting our focus on the balance between powerful and practical, the new IBM Granite 3.0 models deliver state-of-the-art performance relative to model size while maximizing safety, speed and cost-efficiency for enterprise use cases.

Headlining the Granite 3.0 collection is a new, instruction-tuned, dense decoder-only LLM: Granite 3.0 8B Instruct. Trained using a novel two-phase method on over 12 trillion tokens of carefully vetted data across 12 different natural languages and 116 different programming languages, the developer-friendly Granite 3.0 8B Instruct is a workhorse enterprise model intended to serve as a primary building block for sophisticated workflows and tool-based use cases. Granite 3.0 8B Instruct matches leading similarly-sized open models on academic benchmarks while outperforming those peers on benchmarks for enterprise tasks and safety.

Fine-tuning smaller, fit-for-purpose models like Granite enables enterprises to pursue frontier model performance at a fraction of the cost. Tailoring Granite models to your organization’s unique needs through InstructLab, a collaborative, open source approach to augmenting model knowledge and skills with systematically generated synthetic data and phased-training protocols, can reduce costs and timelines even further.

In keeping with IBM’s strong historical commitment to open source, all Granite models are released under the permissive Apache 2.0 license, bucking the recent trend of closed models or open weight models released under idiosyncratic proprietary licensing agreements. In another divergence from industry trends for open models, IBM is providing a detailed disclosure of training data sets and methodologies in the Granite 3.0 technical paper, reaffirming IBM’s dedication to building transparency, safety and trust in AI products.

In its entirety, the IBM Granite 3.0 release comprises:

Dense, general purpose LLMs: Granite-3.0-8B-Instruct, Granite-3.0-8B-Base, Granite-3.0-2B-Instruct and Granite-3.0-2B-Base.
LLM-based input-output guardrail models: Granite-Guardian-3.0-8B, Granite-Guardian-3.0-2B
Mixture of experts (MoE) models for minimum latency: Granite-3.0-3B-A800M-Instruct, Granite-3.0-1B-A400M-Instruct
Speculative decoder for increased inference speed and efficiency: Granite-3.0-8B-Instruct-Accelerator

Impending updates planned for the remainder of 2024 include an expansion of all model context windows to 128K tokens, further improvements in multilingual support for 12 natural languages and the introduction of multimodal image-in, text-out capabilities.

Granite 3.0 8B Instruct and Granite 3.0 2B Instruct, as well as both Guardian 3.0 safety models, are available today for commercial use on the IBM watsonx platform. Granite 3.0 models are also available through platform partners, including Google Vertex AI (through Google Cloud's Vertex AI Model Garden integrations with Hugging Face), Hugging Face, NVIDIA (as NIM microservices), Ollama and Replicate.

Powerful performance, safety and security

Previous generations of Granite models prioritized specialized use cases, excelling at domain-specific tasks across a diverse array of industries including finance, legal, code and academia. In addition to offering even greater efficacy in those arenas, IBM Granite 3.0 models match—and, in some cases, exceed—the general performance of leading open weight LLMs across both academic and enterprise benchmarks.

On academic benchmarks included in Hugging Face’s OpenLLM Leaderboard v2, Granite 3.0 8B Instruct rivals similarly sized models from Meta and Mistral AI. Code for IBM’s model evaluation methodology is disclosed in the accompanying technical paper and on the Granite GitHub repository.

IBM’s efforts to optimize Granite 3.0 8B Instruct for enterprise use cases are also readily apparent. For instance, Granite 3.0 8B Instruct led the evaluations on RAGBench, consisting of 100,000 retrieval augmented generation (RAG) tasks drawn from industry corpora such as user manuals. Models were compared across the 11 RAGBench datasets, evaluating for qualities like faithfulness (measuring the degree to which an output is supported by the retrieved documents) and correctness (measuring the extent to which the model’s output matches the factual content and semantic meaning of the ground truth for a given input).

The Granite 3.0 models were likewise trained to excel in key enterprise domains, such as cybersecurity: Granite 3.0 8B Instruct excels on both IBM’s proprietary cybersecurity benchmarks and prominent public security benchmarks.

Developers can leverage the new Granite 3.0 8B Instruct model for classic natural language use cases including text generation, classification, summarization, entity extraction and customer service chatbots, programing language use cases such as code generation, code explanation and code editing, and for agentic use cases requiring tool calling. When evaluated across 6 different tool calling benchmarks, including Berkeley’s Function Calling Leaderboard evaluation set, Granite 3.0 8B Instruct outperformed leading open models in its weight class.

In addition to exploring the updated collection of Granite recipes and how-to guides on Github, developers can easily experiment with the new Granite 3.0 8B Instruct model on the IBM Granite Playground.

Trust, safety, transparency and innovative training techniques

IBM believes that responsible AI is a competitive advantage, especially in the enterprise environment. The Granite series of generative AI models is developed in accordance with IBM’s principles for trust and transparency.

IBM Granite 3.0’s leading performance is therefore paired with an equal emphasis on model safety. Granite 3.0 8B Instruct demonstrates industry-leading robustness on the AttaQ benchmark, which measures an LLM’s vulnerability to adversarial prompts designed to provoke models into generating harmful, inappropriate or otherwise undesirable prompts.

The entire Granite family of models are trained on carefully curated enterprise datasets, filtered for objectionable content with critical concerns like governance, risk, privacy and bias mitigation in mind, as further detailed in the Responsible Use Guide for IBM Granite. In contrast to the industry trend of increasingly cryptic concealment of training data, IBM continues to disclose the Granite pretraining datasets. In a clear indication of its full confidence in the Granite series, IBM provides an uncapped indemnity for third party IP claims against IBM-developed models.

During model building, the team performed an extensive array of experiments on data recipes across each model size. Thousands of experiments were performed on different data mixtures, alongside hundreds of small 1–2B parameter explorations, to perfect the final data recipes with the highest quality data possible.

This degree of experimentation was made possible by recent breakthroughs from IBM Research regarding optimal learning rates for pre-training LLMs. The learning rate determines the magnitude of updates to model parameters following backpropagation: a well-chosen learning rate ensures faster convergence to optimal model weights (and thereby more cost-effective training) while avoiding overfitting. The inflexibility of traditional learning rate schedulers, which require the number of training steps to be defined in advance, is problematic for large-scale models, for which it’s difficult to predict how many training tokens and update steps are ideal. The IBM Power scheduler adjusts the learning rate based on the number of tokens and batch size based on a power-law equation that models the complex relationship between training variables and hyperparameters.

In training the Granite 3.0 language models, the team leveraged the Data Prep Kit, a framework and toolkit (developed and opensourced by IBM) for forming data processing pipelines for end-to-end processing of unstructured data. Specifically, the Data Prep Kit was used to scale data processing modules from a single laptop to a large cluster and provide lineage tracking, metadata logging and checkpoint capability for recovery from failures.

Granite Guardian: industry-leading safety guardrails

The third generation of IBM Granite also introduces a new family of LLM-based guardrail models, providing the most comprehensive set of risk and harm detection capabilities available in the market today. Granite Guardian 3.0 8B and Granite Guardian 3.0 2B can be used to monitor and manage inputs and outputs to any LLM, whether open or proprietary. Across extensive testing, the Granite Guardian models outperformed all three generation of Meta LlamaGuard while providing additional coverage for key hallucination checks not provided by the latter.

The new Granite Guardian models are variants of their respective correspondingly sized base pre-trained Granite models, fine-tuned to evaluate and classify model inputs and outputs into various categories of risk and harm dimensions including jailbreaking, bias, violence, profanity, sexual content and unethical behavior. In IBM testing, IBM Granite Guardian 3.0 8B demonstrated a 4-point increase over LLamaGuard 3 8B in average F1-score across common public risk detection benchmarks.

The Granite Guardian 3.0 models also cover a range of RAG-specific concerns. IBM testing demonstrated that IBM Granite Guardian 3.0 8B performs competitively with Bespoke-Minicheck-7B, the current state-of-the-art RAG fact-checking model, on benchmarks for detecting RAG hallucinations.

Speed and efficiency: mixture of experts (MoE) models and speculative decoding

The Granite 3.0 release also includes additional inference-efficient offerings: mixture of experts (MoE) models and a speculative decoder for accelerated inference.

IBM Granite’s first MoE models

Granite 3.0 3B-A800M and Granite 3.0 1B-A400M deliver high inference efficiency with a minimal tradeoff in performance. Trained on over 10 Trillion tokens of data, the new Granite MoE models are ideal for deployment in on-device applications, CPU servers and situations requiring extremely low latency.

Their model titles reference both their total parameter counts—3B and 1B, respectively—as well as their active parameter counts: the 3B MoE uses 800M parameters at inference, while the smaller 1B uses 400M parameters at inference. Granite 3.0 3B-A800M comprises 40 expert networks, whereas Granite 3.0 1B-A400M comprises 32 expert networks. Both models utilize top-8 routing.

The Granite 3.0 MoE models are both offered in base pre-trained and instruction tuned variants. Granite 3.0 3B-A800M Instruct is now available to download through Hugging Face, Ollama and NVIDIA. The smaller Granite 3.0 1B-A400M is available through Hugging Face and Ollama. The base pretrained Granite MoE models are presently available only on Hugging Face.

Speculative decoding for Granite 3.0 8B

Speculative decoding is an optimization technique for accelerating model inference speed, helping LLMs generate text faster while using the same (or less) compute resources and allowing more users to utilize a model at the same time. With the newly released Granite-3.0-8B-Instruct-Accelerator model, speculative decoding is used to achieve a 220% speedup in tokens per step.

In standard inferencing, LLMs process each previous token they’ve generated thus far, then generate one token at a time. In speculative decoding, LLMs also evaluate several prospective tokens that might come after the token they’re about to generate—if these “speculated” tokens are verified as sufficiently accurate, one pass can produce two or more tokens for the computational “price” of one. The technique was first introduced in consecutive 2023 papers from DeepMind and Google, using a tiny, separate “draft model” to do the speculative work. Earlier this year, a team of academic researchers released Medusa, an open source method that simply adds an additional layer to the base model.

IBM Research introduced a number of innovations to the Medusa method, chief among which was conditioning the speculated tokens on each other. For example, if “happy” is the first speculated token after “I am,” the model will speculatively predict what comes after happy, rather than continue to predict what follows “I am.” They also introduced a two-phase training method that leverages a form of knowledge distillation to jointly train the base model and speculator. This IBM breakthrough was used to cut the latency of Granite Code 20B in half while quadrupling its throughput.

The Granite 3.0 8B Instruct-Accelerator model is available through Hugging Face, released under the Apache 2.0 license.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Getting started with Granite 3.0 models

Granite 3.0 models are now available on IBM watsonx.ai through platform partners such as Google Vertex AI (through Google Cloud's Vertex AI Model Garden integrations with Hugging Face), Hugging Face, NVIDIA (as NIM microservices), Ollama and Replicate.

A number of guides and recipes for working with Granite models are available in the Granite Snack Cookbook on GitHub, from orchestrating workflows using Granite language models in Langchain to implementing Granite Guardian models for hate, abuse and profanity (HAP) detection.

Developers can also get started with Granite models in the Granite model playground, by exploring the array of useful demos and tutorials in IBM docs, such as:

IBM will continue to expand the third generation of Granite in the coming months, adding exciting new open models and capabilities to the Granite series.

Explore the new IBM Granite 3.0 models→

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.