Training & Fine-tuning
How models learn - 6 chapters
From pre-training on trillions of tokens to RLHF, GRPO, DPO, and evaluation - how raw neural networks become helpful assistants.
Pre-training
Pre-training is where a model learns language itself. Fed trillions of tokens from books, websites, and code, the model learns to predict the next token - and in doing so, acquires grammar, facts, reasoning patterns, and world knowledge.
Training a frontier model costs tens of millions of dollars in compute. The data pipeline is critical: deduplication, quality filtering, toxicity removal, and domain mixing. Scaling laws (Chinchilla) tell us the optimal ratio of parameters to training tokens. A 70B model might train on 15 trillion tokens across thousands of GPUs for months.
Now try it yourself