Skip to main content
← Home

The Attention Mechanism

From dot products to hybrid architectures — 8 chapters

A complete deep dive into attention: why it exists, how it works mathematically, and every modern variant from RoPE and GQA to MLA, sparse attention, and hybrid linear architectures.

01
The problem attention solves

Why Attention?

Before transformers, sequence models processed tokens one at a time. RNNs accumulated context into a fixed-size hidden state — a bottleneck that couldn't hold everything for long sequences. Attention breaks this: every token looks directly at every other token in a single step.

RNNs have O(N) sequential operations, preventing parallelism and creating vanishing gradients across long distances. The 'information bottleneck' means a model processing 512 tokens must compress all context into one vector before generating. Attention has O(1) maximum path length between any two positions — a fact that transforms what models can learn.

RNN path length
O(N)
long-range
Attention path length
O(1)
any distance
Attention compute
O(N²)
per layer
RNN — sequential, one token at a time
The
cat
sat
on
the
mat
hidden state h_t (fixed-size bottleneck)
Token 0 ("The") must survive 5 compression steps to reach the output.
Vanishing gradients. Information lost at distance.
Attention — all tokens in parallel, O(1) path length
Thecatsatonthemat
The
cat
sat
on
the
mat
Every token directly attends to every other in one step.
O(1) path length. Full parallelism. No bottleneck.
The fundamental tradeoff
RNNs: O(N) sequential steps, vanishing gradients, information bottleneck at long range
Attention: O(1) path length between any two tokens, fully parallel, direct gradient flow
Tradeoff: attention is O(N²) compute and memory — solved by FlashAttention, sparse variants, and KV cache optimizations
1 / 8

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.