Skip to main content
← Home

Quantization

Making models fit - 6 chapters

From FP16 to INT4: how to compress LLM weights 2-4x with minimal quality loss. Methods, tradeoffs, and practical deployment.

01
The memory wall

Why Quantize?

A 70B parameter model in FP16 needs 140 GB of GPU memory just for the weights - more than any single consumer GPU has. Quantization compresses weights to 8-bit or 4-bit, cutting memory 2-4x with surprisingly little quality loss.

The key insight: most neural network weights cluster near zero and don't need 16 bits of precision. If you can represent them with 4-8 bits, you can fit models on smaller GPUs, increase batch sizes, and reduce inference cost. The challenge is doing this without destroying model quality.

The Memory Wall
Memory Required (70B model)
FP16
140 GB
INT8
70 GB
INT4
35 GB
GPU Compatibility
A100 80GB
FP16: 2x
INT8: 1 GPU
INT4: 1 GPU
A100 40GB
FP16: 4x
INT8: 2x
INT4: 1 GPU
RTX 4090 24GB
FP16: 6x
INT8: 3x
INT4: 2x
RTX 3090 24GB
FP16: 6x
INT8: 3x
INT4: 2x
RTX 4080 16GB
FP16: 9x
INT8: 5x
INT4: 3x
Estimated Cost ($/hour per GPU)
A100 80GB
FP16: $4.42/hrINT4: $2.21/hr
A100 40GB
FP16: $4.40/hrINT4: $1.10/hr
RTX 4090 24GB
FP16: $4.44/hrINT4: $1.48/hr
RTX 3090 24GB
FP16: $2.64/hrINT4: $0.88/hr
FP16 Size140 GB
INT4 Size35 GB
Savings75%
1 / 6

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.