Quantization

Making models fit - 6 chapters

From FP16 to INT4: how to compress LLM weights 2-4x with minimal quality loss. Methods, tradeoffs, and practical deployment.

The memory wall

Why Quantize?

A 70B parameter model in FP16 needs 140 GB of GPU memory just for the weights - more than any single consumer GPU has. Quantization compresses weights to 8-bit or 4-bit, cutting memory 2-4x with surprisingly little quality loss.

The key insight: most neural network weights cluster near zero and don't need 16 bits of precision. If you can represent them with 4-8 bits, you can fit models on smaller GPUs, increase batch sizes, and reduce inference cost. The challenge is doing this without destroying model quality.

The Memory Wall

Memory Required (70B model)

FP16

140 GB

INT8

70 GB

INT4

35 GB

GPU Compatibility

A100 80GB

FP16: 2x

INT8: 1 GPU

INT4: 1 GPU

A100 40GB

FP16: 4x

INT8: 2x

INT4: 1 GPU

RTX 4090 24GB

FP16: 6x

INT8: 3x

INT4: 2x

RTX 3090 24GB

FP16: 6x

INT8: 3x

INT4: 2x

RTX 4080 16GB

FP16: 9x

INT8: 5x

INT4: 3x

Estimated Cost ($/hour per GPU)

A100 80GB

FP16: $4.42/hrINT4: $2.21/hr

A100 40GB

FP16: $4.40/hrINT4: $1.10/hr

RTX 4090 24GB

FP16: $4.44/hrINT4: $1.48/hr

RTX 3090 24GB

FP16: $2.64/hrINT4: $0.88/hr

FP16 Size140 GB

INT4 Size35 GB

Savings75%

1 / 6