About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
Quantization
Overview
This guide covers quantized IBM Code models.
Quantization in machine learning refers to the process of reducing the precision of the weights and activations of a neural network, typically from 32 or 16-bit floating-point to lower bit-width representations like 8 or 4-bit integers. This process significantly reduces the memory footprint and computational requirements of the model, making it more efficient for deployment, especially in resource-constrained environments.
The quantization scheme
We use llama.cpp, a popular framework for deploying models, to quantize the full-precision models to 4-bit integer mixed precision quantization that balances quality, speed, and memory requirements.
IBM offers quantized code models in four sizes:
Our testing
To evaluate the performance of our quantized models, we conducted tests on an Apple M3 Max with 64GB of memory. The test used the following prompt:
“Write a python class that represents User. A user will need a name, email address, bio, and a list of roles they have access to.”
Response generation rate:
96.63 tokens/s
Model load time:
516.51ms
Inference time:
1.33s
Download via Ollama:
ollama pull granite-code:3b
Model response:
To model the user with their attributes, we can create a Python class named
User
. This class will have four attributes:name
,email
,bio
, androles
. We’ll also define a constructor that takes these attributes as parameters and assigns them to the respective instance variables. Here’s how you could implement this class:class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesResponse generation rate:
56.71 tokens/s
Model load time:
548.45ms
Inference time:
1.49s
Download via Ollama:
ollama pull granite-code:8b
Model response:
Here’s how you can implement this:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef has_access(self, role):return role in self.rolesResponse generation rate:
26.16 tokens/s
Model load time:
15.61s
Inference time:
4.11s
Download via Ollama:
ollama pull granite-code:20b
Model response:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef get_name(self):return self.namedef get_email(self):Response generation rate:
15.32 tokens/s
Model load time:
25.90s
Inference time:
13.13s
Download via Ollama:
ollama pull granite-code:34b
Model response:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef get_name(self):return self.name
Run Granite locally
Check the following resources to run Quantized Granite Code models locally.