Post

Mixture of Experts – Scaling Transformers Without Breaking the FLOPS Bank

Mixture of Experts (MoE) lets you scale transformer models to billions of parameters without proportional compute costs. By selectively routing tokens through specialized experts, MoE achieves massive parameter capacity while keeping inference FLOPS manageable, a crucial technique for modern LLM architecture.

Mixture of Experts – Scaling Transformers Without Breaking the FLOPS Bank

Mixture of Experts – Scaling Transformers Without Breaking the FLOPS Bank

Transformers have revolutionized AI, but their insatiable appetite for compute resources poses a significant challenge. As models grow to hundreds of billions of parameters, the computational cost of training and inference becomes prohibitive. And this is where Mixture of Experts (MoE) come in, a clever architectural innovation that allows us to scale model capacity without a corresponding explosion in FLOPS. By selectively activating only a small subset of specialized “experts” for each input token, MoE achieves the best of both worlds: massive parameter counts for improved performance, and manageable compute costs for practical deployment.

In this post, we’ll explore how MoE works, why it matters, and how different gating mechanisms decide which expert gets to do the work.

Requirements & Prerequisites

Before digging in, you should be comfortable with Transformer internals, understanding self-attention, feed-forward networks (FFNs), and the role of dense layers.

If these topics are new to you, I recommend checking out my previous posts (KV Cache, Attention-scores ) or introductory materials on Transformers before tackling MoE.

Feed foward Networks in Transformer

light mode only

In a standard Transformer (e.g., BERT, GPT), each layer consists at least of:

  1. Multi-Head Attention (MHA): Computes attention scores between tokens.
  2. Feed-Forward Network (FFN): A dense neural network applied independently to each token.

The FFN is typically a two-layer fully connected network with a non-linearity (e.g., ReLU or GELU) in between. For example, in a Transformer with a hidden size of $ d $ and an intermediate size of $ 4d $, the FFN can be represented as:

\[\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 \cdot x + b_1) + b_2\]

Where:

  • $ x $ is the output from the MHA
  • $ W_1 $ is a weight matrix of size $ d \times 4d $,
  • $ W_2 $ is a weight matrix of size $ 4d \times d $,
  • $ b_1 $ and $ b_2 $ are bias terms.

Mixture of Experts: A Unified Explanation

The Core Idea

In a standard Transformer, every token passes through every feed-forward network (FFN) layer. For example, in a dense Transformer with a 1-billion-parameter FFN, every token must be processed by all 1 billion parameters during inference. This becomes computationally expensive as models grow larger.

Mixture of Experts (MoE) addresses this by dividing the FFN into multiple smaller, independent “experts.”, Each expert is typically an independent feed-forward network (FFN), often consisting of two linear layers with a non-linear activation in between. Instead of processing every token through a single monolithic FFN, a gating mechanism routes each token to a small subset of these experts. This allows the model to have a much larger total parameter count while keeping the computational cost per token manageable.

img-description Replacing the FFN with a MOE

Why Focus on the FFN?

The FFN is the primary target for MoE for three key reasons:

  • It’s the Most Parameter-Intensive Component: In a Transformer, the FFN typically accounts for ~70-80% of the parameters in a layer. Replacing it with MoE allows for massive scaling without a proportional increase in compute.

  • Token-Independent Computation: The FFN operates independently on each token, making it a natural candidate for expert specialization. Each expert can learn to process specific types of tokens or patterns.

  • Efficient Routing: The gating mechanism can route tokens to the most relevant experts, ensuring that the model leverages its full capacity efficiently.

For example in GPT-3 (GPT-3 175B) each FFN has 1.2 billion parameters.

Mathematical Formulation

For an input token $x $ and a set of experts $E_1, E_2, \dots, E_n $, the gating function $g(x) $ computes a set of weights ${g_1(x), g_2(x), \dots, g_n(x)} $. The output of the MoE layer is then:

\[\text{MOE}(x) = \sum_{i=1}^{n} g_i(x) \, E_i(x)\]

In most implementations, only the Top-1 or Top-2 experts (those with the highest gating scores) are activated for each token. This sparsity drastically reduces the computational cost while maintaining high model capacity.

Example: MoE vs. Dense FFN

Let’s say we have a Transformer with an MoE layer consisting of 64 experts, each with 100 million parameters. For simplicity, assume the gating mechanism activates 2 experts per token (Top-2 gating), for example the Figure bellow shows experts 1 and 5 being selected.

Parameter Count

Desktop View

  • Total parameters in the MoE layer

    $64 \text{ experts} \times 100 \text{ million parameters/expert}$ $=$ $6.4 \text{ billion parameters}$

  • Since only 2 experts are activated per token, the number of active parameters is:

    $2 \text{ experts} \times 100 \text{ million parameters/expert}$ $=$ $200 \text{ million parameters} $

Computational Cost

  • In a dense FFN with 6.4 billion parameters, every token would need to be processed by all 6.4 billion parameters.
  • In the MoE layer, each token is only processed by 200 million parameters, which is ~3.125% of the total parameters.

FLOPS Comparison

  • Dense FFN:
    If processing one token through a dense FFN requires $ C $ FLOPS, then a 6.4-billion-parameter dense FFN would require $ 6.4 \times C $ FLOPS.

  • MoE FFN:
    With Top-2 gating, the same token only requires $ 2 \times C $ FLOPS (since only 2 experts are active).

This means the MoE layer achieves ~3.125% of the computational cost of a dense FFN with the same total parameter count.

The Memory Paradox: Total vs. Active Parameters

When discussing MoE models, there’s a crucial distinction that’s often overlooked: the difference between total parameters and active parameters.

Total Parameters

This is the complete set of weights stored in the model, which can be enormous:

\[\text{Total Parameters} = \text{Shared Parameters} + \text{Number of Experts} \times \text{Parameters per Expert}\]

For example, Mixtral 8x7B has 47B total parameters, but only ~12.9B are used for any given input.

Active Parameters

These are the parameters actually used during a forward pass:

\[\text{Active Parameters} = \text{Shared Parameters} + \text{Top-K} \times \text{Parameters per Expert}\]

This is the number that matters for inference speed and FLOPS.

Memory Footprint

The total parameters must be stored in some form of memory, which creates several important considerations:

  1. RAM/VRAM Requirements: The entire model must fit in memory during both training and inference.
  2. Memory Bandwidth: Even if only a fraction of parameters are active, the total memory capacity must be sufficient.
  3. Distributed Training: Models like Mixtral and DBRX often require data parallelism and model parallelism during training.

Gating Mechanisms: The Decision-Makers

Here’s where things get spicy. Gating decides which experts process each token. Different gating strategies exist, each with trade-offs in efficiency, balance, and complexity.

1. Top-K Gating

The most common expert selection strategy. For each token, you compute scores and activate the K highest scoring experts.

  • Top-1 Gating (e.g., Switch Transformers):
    Each token uses exactly one expert. Max sparsity, minimal compute cost, but fragile if an expert overloads.

  • Top-2 Gating (e.g., GLaM, Mixtral):
    Each token uses two experts, offering better robustness and representation, at the cost of doubling compute compared to Top-1.

Implementation Sketch

1
2
3
scores = gate_network(input_tokens)  # [batch_size, seq_len, num_experts]
top_k_scores, top_k_indices = torch.topk(scores, k=K, dim=-1)
normalized_scores = softmax(top_k_scores)  # Normalize only selected experts

2. Noisy Top-K Gating

Introduced in the original Sparsely-Gated MoE paper (Shazeer et al.). It adds Gaussian noise to the gating logits before choosing top-K experts.

1
2
noise = torch.randn_like(scores) * noise_epsilon
noisy_scores = scores + noise

Why

  • Increases diversity in expert selection
  • Improves load balancing (experts won’t hog all tokens)
  • Prevents expert collapse during training (where only a few experts get all the work)

3. Hash-Based Gating

Forget learned routing! Hash-based gating uses deterministic functions to assign tokens to experts.

  • GShard (Lepikhin et al.):
    Uses modulo hashing on token position for deterministic assignment.

  • HashLayer:
    Uses content-based hashing on the token itself.

Pros

  • Zero learned routing overhead
  • Scales well for distributed systems

Cons

  • No adaptive routing, potentially underutilizing capacity.

4. Soft Gating

No hard decisions. Instead, every expert gets to process every token, with soft weights determining contribution.

  • Dense MoE: All experts process all tokens.
  • Sparse in the Limit: Start soft (weighted mixture), then anneal towards sparse/hard selection during training.

Trade-offs

  • Expensive (high FLOPS), but
  • Easier to optimize early in training.

5. Hierarchical Gating

Experts are arranged into trees or groups, enabling multi-level decision-making.

  • First level gate: Picks a group
  • Second level gate: Picks an expert within the group

Why

  • Scales better when you have hundreds or thousands of experts
  • Reduces routing overhead
  • Common in very large MoE architectures

6. Dynamic Capacity Gating

Introduced in GO-MoE (Group-wise Orthogonal MoE).

  • Dynamically adjusts the capacity factor (number of tokens per expert)
  • Popular experts can process more tokens, balancing workloads better
  • Maintains fixed compute budgets across devices

7. Expert-Choice Routing

Flips the routing logic on its head:
Instead of tokens choosing experts, experts choose tokens.

  • Particularly useful in distributed settings where experts live on different devices.
  • Reduces communication overhead during multi-device training (critical when scaling horizontally).

8. Balanced Assignment Gating

Used in frameworks like DeepSpeed-MoE.

  • Explicitly enforces balance in expert assignment
  • Uses optimal transport algorithms to assign tokens efficiently
  • Minimizes total routing “cost” while ensuring no expert gets overloaded

Caveat

  • Expensive routing calculation—there’s an additional compute overhead during routing itself.

Handling Load Balancing

A significant challenge with MoE is ensuring that all experts are utilized evenly. Without proper load balancing, the gating mechanism might favor a few experts, underutilizing the rest. Common solutions include:

  • Auxiliary Loss Functions: These losses encourage the gate to distribute tokens more evenly across experts.
  • Dynamic Bias Adjustments: Some approaches adjust expert biases during training so that underused experts gradually receive a higher probability.

Both techniques aim to prevent expert redundancy and ensure each expert can develop its unique specialization.

Mixtral: A Case Study in MoE

Mixtral 8x7B, developed by Mistral AI, has become a reference implementation for MoE in large language models. Understanding its notation helps grasp how MoE is implemented in practice.

Mixtral’s Architecture

The “8x7B” in Mixtral’s name refers to:

  • 8 experts per MoE layer
  • 7B parameters per expert

However, this is slightly misleading since:

  1. Only the FFN layers are MoE-based
  2. The attention layers are dense and shared across all routing paths

Formal Notation

In Mixtral’s implementation, the MoE-based FFN is defined as:

\[\text{MoE-FFN}(x) = \sum_{i=1}^{2} g_i(x) \cdot \text{FFN}_i(x)\]

Where:

  • $g_i(x)$ are the normalized router scores for the top-2 experts
  • $\text{FFN}_i(x)$ are the outputs from the selected experts

Routing and Expert Selection

Mixtral uses a top-2 router which:

  1. Computes scores for all 8 experts using a linear projection
  2. Selects the highest 2 scores
  3. Normalizes these scores with softmax
  4. Computes a weighted sum of the expert outputs

In pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
def mixtral_router(x, num_experts=8, top_k=2):
    # x: input tensor [batch_size, seq_len, hidden_size]
    # Router weights: [hidden_size, num_experts]
    router_logits = x @ router_weights  # [batch_size, seq_len, num_experts]
    
    # Get top-k experts
    router_probs = softmax(router_logits, dim=-1)
    top_k_probs, top_k_indices = torch.topk(router_probs, top_k, dim=-1)
    
    # Normalize the top-k probabilities
    top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
    
    return top_k_probs, top_k_indices

Memory Implications

For Mixtral 8x7B:

  • Total parameters: ~47B
  • Active parameters: ~12.9B

This means that while the model requires storage for all 47B parameters, computationally it behaves like a ~13B parameter model during inference.

MoE Memory Management Strategies

Handling the memory overhead of MoE models requires careful consideration, especially for deployment:

1. Offloading Techniques

  • CPU Offloading: Store inactive experts in CPU memory
  • Disk Offloading: For extremely large models, offload to SSD/NVMe
  • Expert Swapping: Dynamically load/unload experts based on routing decisions

2. Sharded Deployment

  • Tensor Parallelism: Distribute experts across multiple GPUs
  • Expert Parallelism: Each device handles a subset of experts
  • Hybrid Approaches: Combine expert and tensor parallelism for optimal utilization

3. Inference Optimization

  • Batched Evaluation: Process multiple tokens through the same expert simultaneously
  • Expert Pruning: Identify and remove redundant experts post-training

For example, to run Mixtral efficiently on consumer hardware:

1
2
3
4
5
6
# Pseudocode for efficient Mixtral inference
model = MixtralForCausalLM.from_pretrained(
    "mistralai/Mixtral-8x7B-v0.1",
    device_map="auto",       # Automatically manage device placement
    offload_folder="offload" # Offload inactive experts
)

Optimization Pathways for MoE

Combining MoE with additional techniques further enhances efficiency:

  • Sparse Upcycling: Converting dense models into MoE versions to leverage pre-trained knowledge.
  • Advanced Gating Architectures: Experimenting with gating designs that incorporate dynamic bias adjustments, auxiliary load-balancing losses, or even reinforcement learning-based updates.
  • Hybrid Architectures: Integrating MoE layers with other optimization techniques (like KV caching and FlashAttention) to achieve both parameter scalability and compute efficiency.

Conclusion

Mixture of Experts isn’t just a way to cram more parameters into your model; it’s an architectural choice that changes how we think about capacity vs. compute trade-offs. By intelligently routing tokens through a vast pool of specialized experts, MoE unlocks higher capacity with minimal additional computational cost.

From Top-1 gating simplicity to advanced balanced assignment strategies, the landscape of MoE routing is as rich as it is effective. If you’re building large-scale LLMs, MoE can get you to trillion-parameter territory without breaking the FLOPS bank. Just remember: routing is half the battle.

As the field continues to refine gating strategies and address load-balancing challenges, MoE is poised to become a cornerstone in the next generation of deep learning models.

Further reading and references

  • GPT-3 paper (table 2.1 for inferring the parameter count of a FFN)
  • An excellent blog post on MOEs From HF.
  • Another great blog from one of the authors of Hands-On Large Language Models (a book Co-authored by Jay Alammar)

This post is licensed under CC BY 4.0 by the author.