Modern Techniques

Beyond the basics - 5 chapters

From Mixture of Experts and test-time reasoning to tool use, multimodal vision, and long-context attention - the techniques pushing LLMs forward.

Sparse activation, dense knowledge

MoE (Mixture of Experts)

Models like Mixtral and DeepSeek-V2 don't activate all their parameters for every token. A lightweight router network scores each token against N expert FFN blocks (e.g., 8) and selects the top-k (e.g., 2). Same parameter count, a fraction of the compute.

Each expert is a standard feed-forward network. The router is a small linear layer that produces a probability distribution over experts. Only the top-k experts (typically k=2) are activated per token, and their outputs are combined using the router weights as coefficients. Different tokens naturally route to different experts - some specialize in code, others in math or natural language. The load-balancing auxiliary loss prevents expert collapse where all tokens route to the same expert.

MoE Router - token routes to top-2 of 8 experts

Input token

Active experts (top-k)2

k=1k=2k=3k=4

Router scores for "def"

"def"

Router (linear + softmax)

Expert 0

Code

82%

Expert 1

Math

12%

Expert 2

Language

Expert 3

Reasoning

Expert 4

Facts

Expert 5

Creative

Expert 6

Science

Expert 7

Dialog

Compute per token

Dense model (all 8 experts)

100% FLOPs

MoE (top-2 of 8)

25% FLOPs

Key insight: Different tokens naturally route to different experts. "def" routes to the Code expert, "gravity" to Science, "therefore" to Reasoning. The router learns these specializations during training. With top-2 of 8 experts, you get 8x the parameters but only 2x the compute of a single expert.

1 / 5