What is mixture of experts?

Published: 5 April 2024
Contributors: Dave Bergmann

Mixture of experts (MoE) is a machine learning approach that divides an artificial intelligence (AI) model into separate sub-networks (or “experts”), each specializing in a subset of the input data, to jointly perform a task.

Mixture of Experts architectures enable large-scale models, even those comprising many billions of parameters, to greatly reduce computation costs during pre-training and achieve faster performance during inference time. Broadly speaking, it achieves this efficiency through selectively activating only the specific experts needed for a given task, rather than activating the entire neural network for every task.

Though much of the modern implementation of mixture of experts setups was developed over (roughly) the past decade, the core premise behind MoE models originates from the 1991 paper “Adaptive Mixture of Local Experts.” The paper proposed training an AI system composed of separate networks that each specialized in a different subset of training cases. This entailed training both the “expert networks” themselves and a gating network that determines which expert should be used for each subtask. When compared to an analogous conventional model, the authors found that their experimental model was significantly faster to train: it reached the target accuracy threshold in half as many training epochs as the conventional model.¹

In recent years, as the leading deep learning models used for generative AI have grown increasingly large and computationally demanding, mixture of experts offer a means to address the tradeoff between the greater capacity of larger models and the greater efficiency of smaller models. This has been most notably explored in the field of natural language processing (NLP): some leading large language models (LLMs) like Mistral’s Mixtral 8x7B and (according to some reports) OpenAI’s GPT-4,² have employed MoE architecture.

Mixtures of experts in deep learning

Modern deep learning models are built from artificial neural networks, comprising multiple layers of interconnected nodes (or “neurons”). Each neuron has an activation function: a mathematical operation performed on data received from the previous layer, whose output informs the input fed to the following layer. Classic feed-forward neural networks (FFNs) process information by progressively passing input data from neurons in one layer to neurons in the following layer until it reaches an outer layer where final predictions occur. Some neural network architectures incorporate additional elements, like the self-attention mechanisms of transformer models, that capture additional patterns and dependencies in input data. 

The connections between different layers and neurons are mediated by learnable model parameters: variable weights (and biases) that amplify or diminish the influence a given part of the network’s output has on other parts of the network. A deep learning model “learns” by adjusting these parameters, using optimization algorithms like gradient descent, in a way that increases the accuracy of its predictions.

While a larger number of parameters increases the model’s capacity—its ability to absorb information and patterns therein—it also increases the computational resources needed to train and operate the model. In a typical deep learning model—what in this context is referred to as a dense model—the entire network is executed in order to process any and all inputs. This creates a tradeoff between model capacity and practicality. 

Unlike conventional dense models, mixture of experts uses conditional computation to enforce sparsity: rather than using the entire network for every input, MoE models learn a computationally cheap mapping function that determines which portions of the network—in other words, which experts—are most effective to process a given input, like an individual token used to represent a word or word fragment in NLP tasks.

This allows the capacity of the model to be increased (by expanding the total number of parameters) without a corresponding increase in the computational burden required to train and run it (because not all of those parameters will necessarily be used at any given time).

Why AI governance is a business imperative for scaling enterprise AI

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

How do mixture of experts models work?

MoE models process data by designating a number of “experts,” each its own sub-network within a larger neural network, and training a gating network (or router) to activate only the specific expert(s) best suited to a given input.

The primary benefit of the MoE approach is that by enforcing sparsity, rather than activating the entire neural network for each input token, model capacity can be increased while essentially keeping computational costs constant.

On an architectural level, this is achieved by replacing traditional, dense feed-forward network (FFN) layers with sparse MoE layers (or blocks). In the parlance of neural networks, “block” refers to a recurring structural element that performs a specific function. In a sparse MoE model (SMoE), these expert blocks can be single layers, self-contained FFNs or even nested MoEs unto themselves.

For example, in Mistral’s Mixtral 8x7B language model, each layer is composed of 8 feedforward blocks—that is, experts—each of which has 7 billion parameters. For every token, at each layer, a router network selects two of those eight experts to process the data. It then combines the outputs of those two experts and passes the result to the following layer. The specific experts selected by the router at a given layer may be different experts from those selected at the previous or next layer.³

MoE parameter counts

An SMoE may be composed entirely of sparse MoE layers, but many MoE model architectures entail both sparse and dense blocks. This is also true of Mixtral, in which the blocks responsible for the model’s self-attention mechanism are shared across all 8 experts. In practice, this makes designations like “8x7B” potentially misleading: since many of the model’s parameters are shared by each 7-billion-parameter expert sub-network, Mixtral has a total of about 47 billion parameters—not 56 billion, as one might assume through simple multiplication.

This overall parameter count is commonly referenced as the sparse parameter count and can generally be understood as a measure of model capacity. The number of parameters that will actually be used to process an individual token (as it transits through some expert blocks and bypasses others) is called the active parameter count, and can be understood as a measure of the model’s computational costs. Though each token input to Mixtral has access to 46.7 billion parameters, only 12.9 billion active parameters are used to process a given example.

Understanding this optimal utilization of parameter counts is key to understanding the upside of MoE models. For example, Mixtral outperforms the 70-billion-parameter variant of Meta’s Llama 2 across most benchmarks—with much greater speed—despite having a third fewer total parameters and using less than 20% as many active parameters at inference time.³

It’s worth noting, however, that a sparse MoE’s overall parameter count is not totally irrelevant to computational requirements. Despite only using a subset of parameters during inference, the entirety of the model’s parameters most be loaded into memory, meaning that the computational efficiency enjoyed by SMoEs in most regards does not apply to their RAM/VRAM requirements.

Sparsity

Key to the concept (and efficiency) of MoEs is that only some of the experts (and therefore parameters) in a sparse layer will be activated at any given time, thereby reducing active computational requirements.

Though conditional computation had long been proposed as a theoretical means to decouple computational demands from increased model capacity, the algorithmic and performance challenges to its successful execution were not overcome until Shazeer et al’s 2017 paper “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”⁴

The advantages of sparse layers over dense layers are most evident when dealing with high-dimensional data wherein patterns and dependencies are often complex and non-linear: for example, in NLP tasks that call for a model to process a lengthy sequence of text, each word is typically only related to a small subset of other words in that sequence. This makes SMoEs an area of tremendous potential in the field of LLMs, where well-calibrated MoE models can enjoy the benefits of sparsity without sacrificing performance. Sparsely-gated MoE models have also been successfully applied to computer vision tasks,^{5 6} and remain an area of active study in that field. 

This sparsity is achieved through conditional computation: the dynamic activation of specific parameters in response to specific inputs. The effective design of the gating network (or “router”), which enforces that conditional computation, is thus essential to the success of MoE models.

Routing

A number of gating mechanisms can be used to select which experts are utilized in a given situation. The right gating function is critical to model performance, as a poor routing strategy can result in some experts being under-trained or overly specialized and reduce the efficacy of the entire network.

A typical gating mechanism in a traditional MoE setup, introduced in Shazeer’s seminal paper, uses the softmax function: for each of the experts, on a per-example basis, the router predicts a probability value (based on the weights of that expert’s connections to the current parameter) of that expert yielding the best output for a given input; rather than computing the output of all the experts, the router computes only the output of (what it predicts to be) the top k experts for that example. As described earlier, Mixtral uses this classic top-k routing strategy: specifically, it uses top-2 routing—that is, k=2—selecting the best 2 (out of its total of 8) experts.

In their influential 2021 paper, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficiency Sparsity,” Fedus et al. took top-k routing to its extreme: working with Google’s T5 LLM, it replaced the model’s FFN layers with 128 experts and implemented k=1, also called “hard routing.” Even when scaling the model up to a trillion parameters, this setup improved pre-training speeds by 400%.⁶

Load balancing

Despite their many benefits, MoEs add significant complexity to the training process. An important downside to the “vanilla” top-k routing strategy is the potential for the gating network to converge to activating just a few experts. This is a self-reinforcing problem: if a handful of experts are disproportionately selected early on, those experts will be trained more quickly, and then continue to be selected more as they now output more reliable predictions than the other, less-trained experts. This imbalanced load means other experts ultimately end up, figuratively and literally, as dead weight(s).

To mitigate this, Shazeer et al introduced noisy top-k gating: some Gaussian noise is added to the probability values predicted for each expert, introducing some randomness that better encourages more evenly distributed activation of experts. They also added two trainable regularization terms to expert selection: minimizing load balancing loss penalizes an overreliance on any one expert, while minimizing expert diversity loss rewards the equal utilization of all experts.

Google’s 2020 paper, “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,” introduced two additional means of load balancing:

Random routing: While the “top” expert in their top-2 setup is selected using the standard softmax function, the second expert is chosen at semi-random (with the probability of any expert being picked proportionate to the weight of its connection). The second-highest ranked expert is thus most likely to be selected, but no longer guaranteed to be selected.
Expert capacity: The authors set a threshold that defines the maximum number of tokens that can be processed by any one expert. If either of the top-2 chosen experts are at capacity, the token is deemed “overflowed” and skipped ahead to the following layer of the network.⁷

Fine-tuning MoE models

As mentioned earlier, the advantages of sparse models are sometimes tempered by their added complexity. The challenges of implementing MoEs are particularly evident in the fine-tuning process. Sparse models are more prone to overfitting than traditional dense models, and the presence of both sparse MoE layers and dense FFN layers complicates a one-size-fits-all approach.

A number of observations and approaches have been proposed to reduce instability when fine-tuning MoEs. The authors of the Switch Transforlmers paper observed that variants with fewer experts enjoyed more successful fine-tuning, which suggests that the benefits of a larger number of experts in pre-training may be countered by its hindrance to specialization on downstream tasks. 

In the 2022 paper “ST-MoE: Designing Stable and Transferable Sparse Expert Models,” Zoph et al compared the results of 5 different approaches: fine-tuning all parameters (“All”), only non-MoE parameters (“Non MoE”), only MoE parameters (“MoE”), only the self-attention and encoder-decoder attention parameters (“Attention”) and only the non-MoE FFN parameters (“FFN”).

Almost no difference was found between All and Non-MoE
Fine-tuning only Attention parameters resulted in a minor decrease in performance.
Updating only the MoE parameters significantly degraded model performance, despite the fact that roughly 80% of model parameters resided in the sparse MoE layers.
FFN was the only approach that improved performance relative to the All baseline.

The authors hypothesized that because expert layers represented only a quarter of their model’s total layers, and a token will see at most only two experts per layer, isolating MoE parameters results in less comprehensive weight updates (and thus greater overfitting and training loss).⁸

Instruction tuning mixtures of experts
A July 2023 paper, “Mixture-of-Experts Meets Instruction Tuning,” explored the impact of instruction tuning on MoE models using equivalents of Google’s T5 and Flan-T5—a version of T5 instruction-tuned with Google’s Flan protocol (link resides outside ibm.com)—LLMs as a baseline. Their experiment compared four setups: fine-tuning a dense T5 model, fine-tuning a dense Flan-T5 model, fine-tuning an MoE model and fine-tuning an instruction-tuned Flan-MoE model.

As expected, the dense T5 equivalent outperformed the MoE after fine-tuning. But conversely, the fine-tuned Flan-MoE model significantly outperformed the fine-tuned Flan-T5 model. Furthermore, the improvement of Flan-MoE compared to the MoE was even greater than the improvement of Flan-T5 over the original T5.⁹

Encouragingly, this suggests that despite their difficulties with standard fine-tuning, MoE models actually benefit more from instruction tuning than their dense counterparts. This finding was realized with the acclaimed release of Mixtral 8x7B Instruct, an instruction-tuned variant of Mixtral that is offered as a foundation model in IBM watsonx.ai™.

Footnotes

^{NOTE: All links reside outside ibm.com

1} "Adaptive Mixtures of Local Experts," University of Toronto, March 1991
² "AI Expert Speculates on GPT-4 Architecture," Weights and Biases, 21 June 2023
³ "Mixtral of experts," Mistral AI, 11 December 2023
⁴ "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," arXiv, 23 January 2017
⁵ "Scaling Vision with Sparse Mixture of Experts," arXiv, 10 June 2021; "MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection," Papers with Code, 26 September 2023
⁶ "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," arXiv, 11 January 2021 (last updated 16 June 2022)
⁷ "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding," arXiv, 30 June 2020
⁸ "ST-MoE: Designing Stable and Transferable Sparse Expert Models," arXiv, 17 February 2022
⁹ "Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models," arXiv, last updated 5 July 2023