The Inference Engine

Serving at scale - 11 chapters

Explore the infrastructure layer that makes LLMs fast and efficient: batching, KV caching, PagedAttention, FlashAttention, speculative decoding, and production serving.

Two very different phases

Prefill vs Decode

Inference has two distinct phases. Prefill processes the entire prompt in parallel — it's compute-bound and fast. Decode generates tokens one at a time — it's memory-bandwidth-bound and slow. Everything else in this journey is an attempt to work around this fundamental constraint.

During prefill, all prompt tokens are processed simultaneously through attention and FFN layers — the GPU's compute cores are fully utilized. During decode, each new token requires reading the entire KV cache from memory but only does a tiny amount of computation — the GPU spends most of its time waiting for memory reads. This is why decode throughput is measured in tokens/second and is limited by memory bandwidth (GB/s), not FLOPS. Some systems (like Splitwise and DistServ) run prefill and decode on separate hardware optimized for each phase.

Prompt length512 tokens

Prefill

All 512 tokens at once

GPU util

85%

Mem BW

30%

Time: 10ms

Compute-bound

Decode

one by one

1 token per forward pass

GPU util

15%

Mem BW

90%

Per token: 25ms

Memory-bandwidth-bound

The GPU is a supercomputer that spends most of decode waiting for memory reads. This is why memory bandwidth (GB/s) matters more than FLOPS for token generation, and why batching multiple requests together helps - it amortizes the memory reads.

1 / 11

Now try it yourself

Continue learning

Inside the Transformer

Understand the attention mechanism and token generation before diving into serving optimizations.

→

Quantization

INT8, GPTQ, and AWQ — how model weights are compressed to fit in memory.

→

Training & Fine-tuning

How KV cache and memory constraints shape training decisions and batch sizes.

→