From Mixture of Experts and test-time reasoning to tool use, multimodal vision, and long-context attention - the techniques pushing LLMs forward.
01Sparse activation, dense knowledge
MoE (Mixture of Experts)
Models like Mixtral and DeepSeek-V2 don't activate all their parameters for every token. A lightweight router network scores each token against N expert FFN blocks (e.g., 8) and selects the top-k (e.g., 2). Same parameter count, a fraction of the compute.
Each expert is a standard feed-forward network. The router is a small linear layer that produces a probability distribution over experts. Only the top-k experts (typically k=2) are activated per token, and their outputs are combined using the router weights as coefficients. Different tokens naturally route to different experts - some specialize in code, others in math or natural language. The load-balancing auxiliary loss prevents expert collapse where all tokens route to the same expert.
MoE Router - token routes to top-2 of 8 experts
Input token
Active experts (top-k)2
k=1k=2k=3k=4
Router scores for "def"
"def"
Router (linear + softmax)
Expert 0
Code
82%
#1
Expert 1
Math
12%
#2
Expert 2
Language
1%
Expert 3
Reasoning
2%
Expert 4
Facts
1%
Expert 5
Creative
0%
Expert 6
Science
1%
Expert 7
Dialog
1%
Compute per token
Dense model (all 8 experts)
100% FLOPs
MoE (top-2 of 8)
25% FLOPs
Key insight: Different tokens naturally route to different experts. "def" routes to the Code expert, "gravity" to Science, "therefore" to Reasoning. The router learns these specializations during training. With top-2 of 8 experts, you get 8x the parameters but only 2x the compute of a single expert.