GGUF versus GGML

Published: 3 July 2024
Contributors: Tim Mucci

GGUF versus GGML

GPT-Generated Unified Format (GGUF) is a file format that streamlines the use and deployment of large language models (LLMs). GGUF is specially designed to store inference models and perform well on consumer-grade computer hardware.

It achieves this by combining the model parameters (weights and biases) with additional metadata for efficient execution. GGUF is clear, extensible, versatile and capable of incorporating new information without breaking compatibility with older models. GGUF is a more recent development that builds upon the foundations laid out by its predecessor file format, GGML.

GGUF is a binary format that is designed explicitly for the fast loading and saving of models. Because it’s compatible with various programming languages like Python and R, GGUF has added to the format’s popularity. It also supports fine-tuning, so users can adapt LLMs to specialized applications and it stores prompt templates for model deployments across applications. While GGML is still in use, its support has been superseded by GGUF.

How to choose the right AI foundation model

Register for this eBook to learn the best approach to preparing data sets and employing AI models, what outcomes your organization can expect from generative AI projects and why choosing the wrong model can severely impact your business.

Related content

How foundation models and data stores unlock the business potential of generative AI

What is the difference between GGUF and GGML?

GGML was the file format that directly preceded GGUF, created by developer Georgi Gerganov. The name is a combination of Gerganov's initials (GG) and ML for machine learning. GGML was a tensor library designed for high performance on various hardware platforms. It was also an early attempt at making a file format for OpenAI's GPT artificial intelligence models to facilitate the easy sharing and running of models. GGML was designed to be unambiguous and contain all necessary information to load a model.

GGML was an early effort to make large language models accessible on standard hardware. However, it was limited in terms of flexibility and extensibility. This means that GGML needed manual adjustments and faced compatibility issues as users added new features to address its limitations.

GGUF addresses GGML limitations and allows adding new features while maintaining compatibility with older models. Since GGUF eliminates breaking changes, it eases transitions to newer versions and supports a wide range of models, making it a comprehensive solution. Converting existing models to GGUF may be time-consuming and as with all new formats, users and developers must become accustomed to its specifics.

Converting to GGUF

Huggingface is a company and community-driven platform that provides tools and models for natural language processing (NLP). They offer a Transformers Library (link resides outside ibm.com), which includes many pre-trained models that can be converted to GGUF file format. Huggingface also supports fine-tuning and deployment, making it integral to the ecosystem around GGUF.

Transformers are a type of model architecture that has become the backbone of modern NLP. GGUF supports storing and deploying transformer-based models for applications that rely on these advanced architectures.

Why GGUF is important

GGUF provides a robust, flexible and efficient format for language models. It addresses the limitations of previous formats, ensuring compatibility with evolving technologies and techniques. Its enhanced flexibility, improved performance and support for advanced quantization and deployment frameworks make it a critical tool for the future of AI and machine learning.

Model weights are the parameters that are learned by a machine learning model during training. GGUF stores these weights efficiently, allowing for quick loading and inference. Quantization methods applied to model weights can further enhance performance and reduce resource consumption.

Quantization, the process of converting continuous signals into digital formats with fewer possible values, plays a crucial role in GGUF. Quantization enhances efficiency and performance, particularly for hardware with limited resources. By reducing the model size and improving inference speed, quantized models require less computational power, leading to reduced energy consumption. This makes GGUF highly suitable for deployment on edge devices and mobile platforms where power resources are constrained.

For example, one specific quantization technique that is used is GPTQ (Accurate Post-Training Quantization for Generative Pre-trained Transformers). GPTQ reduces the size and computational needs of an LLM by converting its complex data into simpler formats. This allows for deploying LLMs on devices with less memory and processing power.

GGUF is also designed to incorporate new features without compromising compatibility with an earlier version. This capability allows adding new data types and metadata, making GGUF future-proof. As machine learning models evolve, GGUF can accommodate these changes, protecting long-term relevance and adaptability.

GGUF's binary format design significantly improves the speed of loading and saving models, which is particularly vital for applications that require quick deployment and inference. Real-time language conversion services and interactive AI systems, for instance, benefit from GGUF's efficient model file handling. The quicker a model can be loaded and used, the better the user experience in these time-sensitive applications.

GGUF stands out due to its compatibility with advanced tuning techniques like low-rank adaptation (LoRA), quantized low-rank adaptation (QLoRA) and adaptive weight quantization (AWQ). These techniques further optimize model performance and resource utilization.

Moreover, GGUF supports various quant levels, providing flexibility in balancing model accuracy and efficiency. Common quantization schemes that are supported by GGUF include:

2-bit quantization: Offers the highest compression, significantly reducing model size and inference speed, though with a potential impact on accuracy.
4-bit quantization: Balances compression and accuracy, making it suitable for many practical applications.
8-bit quantization: Provides good accuracy with moderate compression, widely used in various applications.

Quants refer to the various quantization levels applied to model weights, such as 2-bit, 4-bit or 8-bit quantization.

GGUF models also use Compute Unified Device Architecture (CUDA), a parallel computing platform and application programming interface that allows models to use GPUs for accelerated computing tasks. This capability enhances language models' computational efficiency and speed. Finally, GGUF's integration with Langchain, a framework for developing and deploying language models, facilitates the deployment of GGUF models so they can be effectively used in development environments and applications.

GGUF models and use cases

Large language model Meta AI (LLaMA)

Meta uses GGUF for its LLaMA models (Llama-2 and Llama-3), which are designed for natural language processing (NLP) tasks including text generation, summarization and question-answering. GGUF in LLaMA allows deployment across different hardware configurations, from high-performance GPUs to the more common consumer-grade CPUs. Llama-3 is the current model.

Text generation webUI

This web interface generates text using LLMs and uses GGUF for model storage and inference. GGUF's flexibility allows users to load large models quickly to perform text generation tasks with minimal latency.

KoboldCpp

A popular client for running LLMs locally, KoboldCpp has adopted GGUF to improve its performance for end-users. This is especially beneficial for hobbyists and researchers who require robust and user-friendly solutions for experimenting with LLMs on personal computers.

Community and ecosystem support

GGUF's development is backed by a collaborative community. Numerous libraries and tools have been developed to support GGUF, ensuring broad adoption and integration into various AI workflows. Some of the key players in this ecosystem include:

llama.cpp: A core library that provides tools for working with GGUF, including conversion utilities and support for running models.
ctransformers: This library supports integrating GGUF models into different programming environments, making it easier for developers to use these models in their applications.
LoLLMS Web UI: A web-based interface that supports GGUF, allowing users to interact with models through a user-friendly interface.

Future-proofing and broader impact

GGUF's introduction marks a shift toward more sustainable and adaptable generative model formats. Its ability to support a wide range of models and configurations means that it is not limited to specific use cases or hardware. This versatility ensures that GGUF can continue to meet the needs of the AI community as new advancements emerge.

Moreover, GGUF's emphasis on compatibility with an earlier version minimizes disruption during upgrades, making it easier for organizations to transition to newer versions without significant downtime or reconfiguration.

GGUF, being an open format, benefits from the collaborative contributions of the open source community, which helps in its development, improvement and widespread adoption. Its adoption in high-profile projects like LLaMA and various AI tools underscores its significance in the ongoing evolution of large language models. By facilitating faster, more flexible and future-proof model deployment, GGUF plays a crucial role in advancing the capabilities of AI systems.