Quantization
Overview
Quantization in machine learning refers to the process of reducing the precision of the weights and activations of a neural network, typically from 32 or 16-bit floating-point to lower bit-width representations like 8 or 4-bit integers. This process significantly reduces the memory footprint and computational requirements of the model, making it more efficient for deployment, especially in resource-constrained environments.
The quantization scheme
We use llama.cpp, a popular framework for deploying models, to quantize the full-precision models to 4-bit integer mixed precision quantization that balances quality, speed, and memory requirements.
IBM offers quantized code models in four sizes:
Our testing
To evaluate the performance of our quantized models, we conducted tests on an Apple M3 Max with 64GB of memory. The test used the following prompt:
“Write a python class that represents User. A user will need a name, email address, bio, and a list of roles they have access to.”
Response generation rate:
96.63 tokens/s
Model load time:
516.51ms
Inference time:
1.33s
Download via Ollama:
ollama pull granite-code:3b
Model response:
To model the user with their attributes, we can create a Python class named
User
. This class will have four attributes:name
,email
,bio
, androles
. We’ll also define a constructor that takes these attributes as parameters and assigns them to the respective instance variables. Here’s how you could implement this class:class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesResponse generation rate:
56.71 tokens/s
Model load time:
548.45ms
Inference time:
1.49s
Download via Ollama:
ollama pull granite-code:8b
Model response:
Here’s how you can implement this:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef has_access(self, role):return role in self.rolesResponse generation rate:
26.16 tokens/s
Model load time:
15.61s
Inference time:
4.11s
Download via Ollama:
ollama pull granite-code:20b
Model response:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef get_name(self):return self.nameResponse generation rate:
15.32 tokens/s
Model load time:
25.90s
Inference time:
13.13s
Download via Ollama:
ollama pull granite-code:34b
Model response:
class User:def __init__(self, name, email, bio, roles):self.name = nameself.email = emailself.bio = bioself.roles = rolesdef get_name(self):return self.name
Run Granite Locally
Check the following resources to run Quantized Granite Code models locally.