Skip to main content
← Home

Modern Techniques

Beyond the basics - 5 chapters

From Mixture of Experts and test-time reasoning to tool use, multimodal vision, and long-context attention - the techniques pushing LLMs forward.

01
Sparse activation, dense knowledge

MoE (Mixture of Experts)

Models like Mixtral and DeepSeek-V2 don't activate all their parameters for every token. A lightweight router network scores each token against N expert FFN blocks (e.g., 8) and selects the top-k (e.g., 2). Same parameter count, a fraction of the compute.

Each expert is a standard feed-forward network. The router is a small linear layer that produces a probability distribution over experts. Only the top-k experts (typically k=2) are activated per token, and their outputs are combined using the router weights as coefficients. Different tokens naturally route to different experts - some specialize in code, others in math or natural language. The load-balancing auxiliary loss prevents expert collapse where all tokens route to the same expert.

MoE Router - token routes to top-2 of 8 experts
Input token
Active experts (top-k)2
k=1k=2k=3k=4
Router scores for "def"
"def"
Router (linear + softmax)
Expert 0
Code
82%
#1
Expert 1
Math
12%
#2
Expert 2
Language
1%
Expert 3
Reasoning
2%
Expert 4
Facts
1%
Expert 5
Creative
0%
Expert 6
Science
1%
Expert 7
Dialog
1%
Compute per token
Dense model (all 8 experts)
100% FLOPs
MoE (top-2 of 8)
25% FLOPs

Key insight: Different tokens naturally route to different experts. "def" routes to the Code expert, "gravity" to Science, "therefore" to Reasoning. The router learns these specializations during training. With top-2 of 8 experts, you get 8x the parameters but only 2x the compute of a single expert.

1 / 5

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.