What Is Llama 2?

Published: 19 December 2023
Contributors: Dave Bergmann

What is Llama 2?

Llama 2 is a family of pre-trained and fine-tuned large language models (LLMs) released by Meta AI in 2023. Released free of charge for research and commercial use, Llama 2 AI models are capable of a variety of natural language processing (NLP) tasks, from text generation to programming code.

The Llama 2 model family, offered as both base foundation models and fine-tuned “chat” models, serves as the successor to the original LLaMa 1 models, which were released in 2022 under a noncommercial license granting access on a case-by-case basis exclusively to research institutions. Unlike their predecessors, Llama 2 models are available free of charge for both AI research and commercial use.

Meta’s Llama models thus aim to play an important role in democratizing the generative AI ecosystem. As noted in the Llama 2 research paper (link resides outside ibm.com), though the methodology for pre-training autoregressive LLMs via self-supervised learning is, by now, relatively straightforward and well understood, the immense computational requirements the process entails have largely limited the development of cutting-edge LLMs to a few key players. Because most state-of-the-art LLMs, like OpenAI’s GPT, Anthropic’s Claude and Google’s BARD are proprietary (and massive) closed-source models, public AI research access that might help understand how and why these models work—and how to better align their development with human interests—has been markedly limited.

In addition to making its code and model weights freely available, the Llama project has focused advancing the performance capabilities of smaller models, rather than through increasing parameter count. Whereas most prominent closed-source models have hundreds of billions of parameters, Llama 2 models are offered with seven billion (7B), 13 billion (13B) or 70 billion parameters (70B).

This enables smaller organizations, like startups and members of the research community, to deploy local instances of Llama 2 models—or Llama-based models developed by the AI community—without needing prohibitively expensive computing time or infrastructure investments.

Learn more: IBM to Make Llama 2 Available within its Watsonx AI and Data Platform

Llama 2 vs. LLaMa 1

The Llama 2 research paper details several advantages the newer generation of AI models offers over the original LLaMa models.

Greater context length: Llama 2 models offer a context length of 4,096 tokens, which is double that of LLaMa 1. The context length (or context window) refers to the maximum number of tokens the model can “remember” during inferencing (i.e. the generation of text or an ongoing conversation). This allows for greater complexity and a more coherent, fluent exchange of natural language.
Greater accessibility: Whereas LLaMa 1 was released exclusively for research use, Llama 2 is available to any organization (with fewer than 700 million active users).
More robust training: Llama 2 was pre-trained on 40% more data, increasing its knowledge base and contextual understanding. Furthermore, unlike LLaMa 1, Llama 2 chat models were fine-tuned using reinforcement learning from human feedback (RLHF), helping better align model responses with human expectations.

The data store for AI

Discover the power of integrating a data lakehouse strategy into your data architecture, including enhancements to scale AI and cost optimization opportunities.

Related content

Is Llama 2 open source?

Though Meta has made the starting code and model weights for Llama 2 models freely available for research and commercial use, certain restrictions in its licensing agreement have caused debate regarding whether it can properly be called “open source.”

The debate is somewhat technical and semantic: though “open source” is often used colloquially to refer to any software (or other programming tools) whose source code is distributed free of charge, it is actually a formal designation stewarded by the Open Source Initiative (OSI). The OSI only certifies a given software license as “Open Source Initiative approved” if it deems said license to meet the ten requirements listed in the official Open Source Definition (OSD) (link resides outside ibm.com).

As explained in a statement from OSI Executive Director Stefano Maffulli, “OSI is pleased to see that Meta is lowering barriers for access to powerful AI systems. Unfortunately, the tech giant has created the misunderstanding that LLaMa 2 is “open source” – it is not.” ¹

The discrepancy stems from two aspects of the Llama 2 license agreement:

Any organization with greater than 700 million monthly active users must request a license from Meta (to be granted at Meta’s sole discretion).²
The “Acceptable Use Policy” prohibits use of the models for violence, criminal activity and impersonating humans, among other and legal and moral restrictions.

These restrictions contradict two points of the OSD:

Point 5: “The license must not discriminate against any person or group of persons." ³
Point 6: “The license must not restrict anyone from making use of the program in a specific field of endeavor.”³

To acknowledge both the open spirit of Llama 2 and its failure to meet the technical definition of “open source,” some in the tech community have used the term “open approach.” ⁴

How does Llama 2 work?

Llama 2 base models are pre-trained foundation models meant to be fine-tuned for specific use cases, whereas Llama 2 chat models are already optimized for dialogue.

Llama 2 base models

Llama 2 is a family of transformer-based autoregressive causal language models. Autoregressive language models take a sequence of words as input and recursively predict—output—the next word(s).

During self-supervised pre-training, LLMs are provided the beginning of sample sentences drawn from a massive corpus of unlabeled data and tasked with predicting the next word. In training the model to minimize divergence between ground truth (the actual next word) and its own predictions, the model learns to replicate linguistic and logical patterns in the training data. Though the research paper notably omits details on specific data sources, it states that Llama 2 was trained with 2 trillion tokens—the numerically-represented words, word parts, phrases and other semantic fragments that transformer-based neural networks use for language processing—from publicly available sources.

On a fundamental level, base foundation models are not pre-trained to actually answer a prompt: they append text to it in a grammatically coherent way. An out-of-the-box foundation model might respond to a prompt of “teach me to bake cookies” with “for a holiday party.” Further fine-tuning, via techniques like supervised learning and reinforcement learning, is required to train a foundation model for a specific application like dialogue, instruction following or creative writing.

Instead, base Llama 2 models are intended to serve as a foundation to build a purpose-specific model upon. To date, Llama 2 (and the original LLaMa) models have served as the base of several prominent open source LLMs, including:

Alpaca: A version of LLaMa 7B fine-tuned for instruction-following by Stanford University researchers. Notably, it achieved results competitive with GPT-3.5 despite costing only USD 600 in compute resources.⁵
Vicuna: A chat assistant model from LMSYS Org, trained by fine-tuning Llama 2 13B on user conversations from ShareGPT (link resides outside ibm.com). It outperformed Alpaca in over 90% of cases at a mere USD 300 training cost.⁶
Orca: A fine-tuned version of Llama 2, trained by Microsoft using a “teacher-student” scheme, in which a larger, more powerful LLM is used to generate examples of useful reasoning behavior for the smaller model to follow.⁷
WizardLM: Fine-tuned using Evol-Instruct, a method to create large amounts of synthetic instruction data using LLMs, WizardLM achieved over 90% of the performance of ChatGPT in 17 of 29 skills evaluated.⁸

Llama 2 chat models

Llama-2-chat models are fine-tuned for dialogue-driven use cases, similar to the specific GPT model versions used in ChatGPT.

Supervised fine tuning (SFT) was used to prime the pre-trained Llama 2 base model to generate responses in the format expected by users in a chatbot or virtual agent setting. In a series of supervised learning tasks, labeled pairs of dialogue-style exchanges, annotated as (prompt, response), are used to train the model to minimize the divergence between its own response for a given prompt and the example response provided by the labeled data. The model thus learns, for example, that the proper response to a prompt of “teach me to bake cookies” is to provide actual instructions to bake cookies, rather than merely complete the sentence.

Rather than using millions of labeled examples, the paper states that results were improved by using “fewer but higher-quality examples,” noting that Meta AI collected 27,540 annotated samples.

Following SFT, Meta used reinforcement learning with human feedback (RLHF) to further align the chat models’ behavior with human preferences and instructions. In RLHF, direct human feedback is used to train a “reward model” to learn patterns of the kind of responses humans prefer. By translating the reward model’s predictions (regarding whether a given response would be preferred by humans) into a scalar reward signal, the reward model is then used to further train Llama-2-chat via reinforcement learning.

There are many different methods and formats in which that human feedback can be collected. Meta AI used a simple method of binary comparison: human annotators were asked to write a prompt, then choose between two model responses—based on criteria provided by Meta—generated by two different variants of Llama 2. To help the reward model properly weight these choices, annotators were also asked to rate the degree to which they preferred their chosen response over the other: “significantly better,” “slightly better” or “negligibly better/unsure.”

Human preferences were used to train two separate reward models: one optimized for helpfulness, the other optimized for safety (i.e. avoiding toxic, hateful responses or responses that might be used to aid in violence or criminal activity). In addition to proximal policy optimization (PPO), the algorithm typically used to update LLM model weights in RLHF, Meta also used rejection sampling (link resides outside ibm.com) to update Llama-2-chat-70B.

Code Llama

Code Llama, built on top of Llama 2, is fine-tuned for generating code (and natural language about code) from both code-based and natural language-based prompts. Introduced shortly after the release of the Llama 2 base and chat models, it’s free for research and commercial use.

Supporting most popular programming languages, including Python, C++, Java, PHP, and Javascript (among others), it’s available in model sizes of 7B, 13B and 34B parameters, and boasts a context length of up to 100,000 tokens. Two additional variations, Code Llama - Python and Code Llama - Instruct, are fine-tuned for Python (and PyTorch) and instruction following, respectively.

Llama 2 vs. closed-source models

Relative to its closed-source competitors, Llama 2 models excel in areas like safety and factual accuracy. Though Llama 2 may not match the full capabilities of much larger models, its open availability and greater efficiency present unique advantages.

In comparing Llama 2 to the flagship proprietary models from competitors like OpenAI, Anthropic and Google, it’s important to consider scale. Though closed-source models do not always disclose the full details of their architecture, available information strongly suggests that they all greatly exceed the largest Llama 2 models’ 70 billion parameters:

GPT-3 has 175 billion parameters.
GPT-4 is estimated to have 1 trillion parameters.⁹
Google’s PaLM 2 is reported to have 340 billion parameters.¹⁰ Its predecessor, PaLM, has 540 billion parameters.¹¹
Anthropic has not released parameter counts for Claude models, but a recent paper suggests the existence of a 175 billion-parameter version of Claude 2.¹²

Human evaluation
According to the Llama 2 research paper, human evaluators preferred Llama-2-chat 70B responses to those of GPT-3.5.-turbo-0301, the standard model for ChatGPT: Llama 2 responses had a win rate of 36% and a tie rate of 31.5%. Relative to PaLM Bison, the second largest PaLM model, 70B had a win rate of over 50%.

Safety
In Meta’s testing, the 7B, 13B and 70B Llama 2 models all had significantly lower safety violation percentages than PaLM Bison—3% and 4%, compared to PaLM’s 27%—as well as lower safety violation percentages than ChatGPT’s 7%. This is a major strength for enterprise use cases, in which toxic, hateful or inflammatory language from chatbots can have major consequences.

Privacy and efficiency
An inherent advantage of smaller, open models over massive closed-source models is the freedom for businesses to run local model instances and the cost-efficiency to do so without massive investments in infrastructure or cloud computing. Running a local model ensures that proprietary code, training modifications and proprietary data can be used to fine-tune model performance without being loaded to a commercial server or potentially being used in future training of closed-source models. Furthermore, smaller model sizes, like the 7B and 13B variants, enable smoother performance in environments like mobile apps where processing power is limited.

How to use Llama 2

Llama 2 does not have its own dedicated API, but it’s accessible through multiple providers.

Llama-2-13B-chat and Llama-2-70B-chat are among the many foundation models available in watsonx, through IBM’s partnership with Hugging Face.
Model weights and starting code for Llama 2 can be downloaded directly from Github, where Meta also provides instructions, demos and “recipes” for Llama 2 (link resides outside ibm.com). The models can be implemented in open source machine learning frameworks like PyTorch or LangChain.
Llama 2 is also available through both open source providers like Hugging Face and enterprise providers like Microsoft Azure and Amazon Sagemaker and Bedrock, in addition to a number of cloud-based startups.

Footnotes