Large and complex machine learning models, such as those used for large language models (LLMs) like ChatGPT, take a long time and numerous resources to set up. They might have trillions of parameters that are set to specific values. Once this process is complete, the model might be powerful and accurate in general, but it is not necessarily fine-tuned to carry out specific tasks.
Getting a model to work in specific contexts can require a great deal of retraining, changing all its parameters. With the number of parameters in such models, this retraining is expensive and time-consuming. LoRA provides a quick way to adapt the model without retraining it.
As an example, a full fine-tuning of the GPT-3 model requires training 175 billion parameters because of the size of its training dataset. Using LoRA, the trainable parameters for GPT-3 can be reduced to roughly 18 million parameters, which reduces GPU memory requirements by roughly two thirds.
LoRA is not the only efficient fine-tuning method. A variant of LoRA is quantization LoRA (QLoRA), a fine tuning technique that combines a high-precision computing technique with a low-precision storage method. This helps keep the model size small while still making sure the model is still highly performant and accurate.