The model passes the tokens through a transformer network. Transformer models, introduced in 2017, are useful due to their self-attention mechanism, which allows them to “pay attention to” different tokens at different moments. This technique is the centerpiece of the transformer and its prime innovation. Self-attention is useful in part because it allows the AI model to calculate the relationships and dependencies between tokens, especially ones that are distant from one another in the text. Transformer architectures also allow for parallelization, making the process much more efficient than previous methods. These qualities allowed LLMs to handle unprecedentedly large datasets.
Once text is split into tokens, each token is mapped to a vector of numbers called an embedding. Neural networks consists of layers of artificial neurons, where each neuron performs a mathematical operation. Transformers consist of many of these layers, and at each, the embeddings are slightly adjusted, becoming richer contextual representations from layer to layer.
The goal in this process is for the model to learn semantic associations between words, so that words like “bark” and “dog” appear closer together in vector space in an essay about dogs than “bark” and “tree” would, based on the surrounding dog-related words in the essay. Transformers also add positional encodings, which give each token information about its place in the sequence.
To compute attention, each embedding is projected into three distinct vectors using learned weight matrices: a query, a key, and a value. The query represents what a given token is “seeking,” the key represents the information that each token contains, and the value “returns” the information from each key vector, scaled by its respective attention weight.
Alignment scores are then computed as the similarity between queries and keys. These scores, once normalized into attention weights, determine how much of each value vector flows into the representation of the current token. This process allows the model to flexibly focus on relevant context while ignoring less important tokens (like “tree”).
Self-attention thus creates “weighted” connections between all tokens more efficiently than earlier architectures could. The model assigns weights to each relationship between the tokens. LLMs can have billions or trillions of these weights, which are one type of LLM parameter, the internal configuration variables of a machine learning model that control how it processes data and makes predictions. The number of parameters refers to how many of these variables exist in a model, with some LLMs containing billions of parameters. So-called small language models are smaller in scale and scope with comparatively few parameters, making them suitable for deployment on smaller devices or in resource-constrained environments.
During training, the model makes predictions across millions of examples drawn from its training data, and a loss function quantifies the error of each prediction. Through an iterative cycle of making predictions and then updating model weights through backpropagation and gradient descent, the model “learns” the the weights in the layers that produce the query, key and value vectors.
Once those weights are sufficiently optimized, they’re able to take in any token’s original vector embedding and produce query, key and value vectors for it that, when interacting with the vectors generated for all the other tokens, will yield “better” alignment scores that in turn result in attention weights which help the model produce better outputs. The end result is a model that has learned patterns in grammar, facts, reasoning structures, writing styles and more.