IBM granite-13b-chat-v2 model card

Granite Base 13 Billion Model Chat (`granite-13b-chat-v2`) Details

Model Version (2.1.0): Released 2/15/2024

The Granite 13 Billion chat V2 (granite-13b-chat-v2) model is the chat-focused variant initialized from the pre-trained Granite Base 13 Billion Base V2 (granite-13b-base-v2) model. granite-13b-base-v2 has been trained using over 2.5T tokens. IBM Generative AI Large Language Foundation Models are Enterprise-level English-language models trained with large a volume of data that has been subjected to intensive pre-processing and careful analysis.

granite-13b-chat-v2 is a chat-focused model that was tuned to improve its ability to perform Retrieval Augmented Generation (RAG) use cases. In version 2.1.0, IBM has applied a novel alignment technique for LLMs using large-scale targeted alignment for a generalist LLM. This alignment technique significantly improves base model performance by knowledge infusion during the initial phase of alignment, and improves instruction following performance via skills and style infusion in the subsequent phase of alignment.

The table below lists the Massive Multitask Language Understanding (MMLU) and Multi-Turn(MT) Bench benchmarks used to show the performance.

Variant	Description / Intended Use	Pre-training Data Seen	MMLU (5-shot)	MT Bench
`granite-13b-chat-v2` (v2.1.0)	This variant is a chat-focused version of the base model which supports RAG, summarization and generation tasks	2.5T Tokens	57	6.92

Person or organization developing the model:
- granite-13b-chat-v2 was developed by IBM Research.
Model release date and version:
- Model version 2.1.0 is released 2/15/2024.
- granite-13b-chat-v2 was originally released on 11/30/2023.
Model type:
- granite-13b-chat-v2 is a large decoder-only transformer model.
- The following features were used in the design of the model:
  1. Decoder-only model
  2. Multi-Query Attention
  3. 50K GPT-NeoX tokenizer
  4. Flash Attention
  5. 8k context length
  6. Absolute (learnt) position embeddings
Information about training algorithms, parameters, fairness constraints or other applied approaches, and features:
- Model was trained using 4x Tensor Parallel + 4x Pipeline Parallel + Megatron distributed optimizer Megatron-LM.
- Cluster: CCC
- GPUs: 256x A100 80GB
- Interconnect: 200 gigabit Infiniband
- Dataset streamed over GPFS
Paper or other resource for more information:
- https://www.ibm.com/downloads/cas/X9W4O6BM
License:
- Available only through IBM products and offerings. Contact IBM for licensing terms.

Version Release notes:

With version 2.1.0 of granite-13b-chat-v2 IBM implemented a number of new techniques specifically targeted at improving the model's ability to aquire knowledge and revelant skills. To accomplish this goal, the team applied IBM Research's novel alignment technique for LLMs using large-scale targeted alignment for a generalist LLM, which includes several synthetic data innovations. The behaviors targeted in the alignment step include:

Significantly improved quality of generated of response
Improved multi-turn conversation capabilities
Improved safety / bias reduction
High quality content-grounded responses

The manner in which changes were implemented for this modification should allow for previously engineered prompts to transfer well to this version. Some prompts may need to be reworked.

Intended Use

Primary intended uses:
- English-based closed-domain Question and Answering (e.g. RAG), summarization, and generation, extraction, and classification.
- The granite-13b-chat-v2 model has demonstrated the capability to support longer responses preferred in RAG-like use cases.
Primary intended users:
- The primary users are IBM Enterprise clients looking to bolster their portfolios with Enterprise-level generative AI models.
Out-of-scope use cases:
- The granite-13b models are not designed, tested, or supported, for code use cases of any kind.

Factors

Relevant factors: Models work with proper english text. All datasets have been cleansed of any type of tagging (e.g., HTML), and all media has been removed as well.
Evaluation factors: Evaluation datasets have to be proper english, and are limited to text only.

Metrics

granite.13b.chat.v2 was evaluated using the following two well known datasets:

Benchmark	Description
MMLU	Benchmark used to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
MT-Bench	Benchmark consisting of 80 high-quality multi-turn questions. MT-bench is designed to test multi-turn conversation and instruction-following ability, covering common use cases and focusing on challenging questions to differentiate models.

Performance Metrics

The evaluation of granite-13b-v2 can be found in the Granite Paper: https://www.ibm.com/downloads/cas/X9W4O6BM

Data, Limitations, and Recommendations

Data selection for training:
- The granite-13b-base-v2 model underwent extended training using 2.5 Trillion tokens of IBM's curated pre-training dataset. On top of the 2.5 trillion tokens from granite-13b-base-v2, the granite-13b-chat.v2 model underwent knowledge and skill focused tuning. A breakdown of the sampling data used for training is shown in the table below.

Dataset sampling for Granite (13B) Base V2.0

Dataset	Description
Common Crawl	Open repository of web crawl data.
Webhose	Unstructured web content converted into machine-readable data feeds acquired by IBM.
arXiv	Over 1.8 million scientific paper pre-prints posted to arXiv.
Wikimedia	Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary) containing extracted plain text from pages and articles.
OpenWeb Text	Open-source version of OpenAI’s Web Text corpus containing web pages through 2019.
Stack Exchange	Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers.
Hacker News	News on computer science and entrepreneurship, taken between 2007-2018.
Project Gutenberg PG19	A repository of free e-books with focus on older works for which U.S. copyright has expired.
GitHub Clean	Code data from CodeParrot covering a variety of coding languages.
Pubmed Central	Biomedical and life sciences papers.
Free Law	Public-domain legal opinions from US federal and state courts.
SEC Filings	10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022.
Patents	US patents granted from 1975 to May 2023, excluding design patents.
DeepMind Mathematics	Mathematical question and answer pairs data.
Earning Calls Transcript	Transcripts from the quarterly earnings calls that companies hold with investors. The dataset reports a collection of earnings call transcripts, the related stock prices, and the sector index.
EDGAR	This corpus comprises of annual reports from all the publicly traded companies in the US spanning a period of more than 25 years.
FDIC	The data is from the annual submissions of the FDIC.
Finance Textbooks	This corpus is from Open Textbook Library which is UMN's free textbook library, and this dataset includes the dump of all textbooks tagged as finance.
Financial Research Papers	Publicly available financial research paper corpus.
IBM Documentation	IBM redbooks and product documents.

Tokenizer used:
- GPT-NeoX 20B
- 2.5 Trillion Tokens
- More...

Dataset sampling for Granite (13B) Chat V2

The granite-13b-chat-v2 model, version 2.1.0, was initialized from granite-13b-base-v2, and was aligned using a novel training paradigm for LLMs where large-scale targeted alignment for a generalist LLMs. The alignment approach relies on IBM-generated synthetic data using a proprietary pipeline and approach.