Skip to main contentIBM Granite

Quantization

Overview

Quantization in machine learning refers to the process of reducing the precision of the weights and activations of a neural network, typically from 32 or 16-bit floating-point to lower bit-width representations like 8 or 4-bit integers. This process significantly reduces the memory footprint and computational requirements of the model, making it more efficient for deployment, especially in resource-constrained environments.

The quantization scheme

We use llama.cpp, a popular framework for deploying models, to quantize the full-precision models to 4-bit integer mixed precision quantization that balances quality, speed, and memory requirements.

IBM offers quantized code models in four sizes:

Our testing

To evaluate the performance of our quantized models, we conducted tests on an Apple M3 Max with 64GB of memory. The test used the following prompt:

“Write a python class that represents User. A user will need a name, email address, bio, and a list of roles they have access to.”

  • Response generation rate:

    96.63 tokens/s

    Model load time:

    516.51ms

    Inference time:

    1.33s

    Download via Ollama:

    ollama pull granite-code:3b

    Model response:

    To model the user with their attributes, we can create a Python class named User. This class will have four attributes: name, email, bio, and roles. We’ll also define a constructor that takes these attributes as parameters and assigns them to the respective instance variables. Here’s how you could implement this class:

    class User:
    def __init__(self, name, email, bio, roles):
    self.name = name
    self.email = email
    self.bio = bio
    self.roles = roles

Run Granite Locally

Check the following resources to run Quantized Granite Code models locally.