From FP16 to INT4: how to compress LLM weights 2-4x with minimal quality loss. Methods, tradeoffs, and practical deployment.
01The memory wall
Why Quantize?
A 70B parameter model in FP16 needs 140 GB of GPU memory just for the weights - more than any single consumer GPU has. Quantization compresses weights to 8-bit or 4-bit, cutting memory 2-4x with surprisingly little quality loss.
The key insight: most neural network weights cluster near zero and don't need 16 bits of precision. If you can represent them with 4-8 bits, you can fit models on smaller GPUs, increase batch sizes, and reduce inference cost. The challenge is doing this without destroying model quality.