The Inference Engine
Serving at scale - 11 chapters
Explore the infrastructure layer that makes LLMs fast and efficient: batching, KV caching, PagedAttention, FlashAttention, speculative decoding, and production serving.
Prefill vs Decode
Inference has two distinct phases. Prefill processes the entire prompt in parallel — it's compute-bound and fast. Decode generates tokens one at a time — it's memory-bandwidth-bound and slow. Everything else in this journey is an attempt to work around this fundamental constraint.
During prefill, all prompt tokens are processed simultaneously through attention and FFN layers — the GPU's compute cores are fully utilized. During decode, each new token requires reading the entire KV cache from memory but only does a tiny amount of computation — the GPU spends most of its time waiting for memory reads. This is why decode throughput is measured in tokens/second and is limited by memory bandwidth (GB/s), not FLOPS. Some systems (like Splitwise and DistServ) run prefill and decode on separate hardware optimized for each phase.
The GPU is a supercomputer that spends most of decode waiting for memory reads. This is why memory bandwidth (GB/s) matters more than FLOPS for token generation, and why batching multiple requests together helps - it amortizes the memory reads.
Now try it yourself