Quantization – IBM Granite

Overview

Quantization in machine learning refers to the process of reducing the precision of the weights and activations of a neural network, typically from 32 or 16-bit floating-point to lower bit-width representations like 8 or 4-bit integers. This process significantly reduces the memory footprint and computational requirements of the model, making it more efficient for deployment, especially in resource-constrained environments.

The quantization scheme

We use llama.cpp, a popular framework for deploying models, to quantize the full-precision models to 4-bit integer mixed precision quantization that balances quality, speed, and memory requirements.

IBM offers quantized code models in four sizes:

granite-3b-code-instruct-GGUF

granite-8b-code-instruct-GGUF

granite-20b-code-instruct-GGUF

granite-34b-code-instruct-GGUF

Our testing

To evaluate the performance of our quantized models, we conducted tests on an Apple M3 Max with 64GB of memory. The test used the following prompt:

“Write a python class that represents User. A user will need a name, email address, bio, and a list of roles they have access to.”

Response generation rate:

96.63 tokens/s

Model load time:

516.51ms

Inference time:

1.33s

Download via Ollama:

ollama pull granite-code:3b

Model response:

To model the user with their attributes, we can create a Python class named User. This class will have four attributes: name, email, bio, and roles. We’ll also define a constructor that takes these attributes as parameters and assigns them to the respective instance variables. Here’s how you could implement this class:

class User:
  def __init__(self, name, email, bio, roles):
      self.name = name
      self.email = email
      self.bio = bio
      self.roles = roles
Copy to clipboard

20B

34b

Run Granite Locally

Check the following resources to run Quantized Granite Code models locally.

Run on Mac

Run on Linux

How-to guides: Fine-tuning

Use cases: Granite Code IDE Extension