Tokenizer Training

Building the vocabulary - 6 chapters

How BPE builds a vocabulary from raw bytes. Merge rules, byte-level encoding, vocabulary design, and why tokenizer choices shape everything downstream.

The vocabulary problem

Why Build a Tokenizer?

A language model can't process raw text - it needs a fixed vocabulary of discrete tokens. The tokenizer is trained BEFORE the model, and its choices permanently shape everything downstream: what the model can represent efficiently, how long sequences become, and which languages work well.

A bad tokenizer can make your model waste capacity. If 'unhappiness' is one token in English but 6 tokens in another language, the model needs 6x more positions (and attention computation) for the same concept. GPT-2's tokenizer was trained mostly on English web text, which is why it tokenizes non-English text and code inefficiently. LLaMA 3 and Qwen 2.5 trained on multilingual data, resulting in 128K-152K token vocabularies that handle many languages well.

The Vocabulary Problem

Same meaning - "The cat sat on the mat" - across languages

EnglishThe cat sat on the mat

SpanishEl gato se sentó en la alfombra

Japanese猫がマットの上に座った

Arabicجلست القطة على الحصيرة

Korean고양이가 매트 위에 앉았다

Chinese猫坐在垫子上

Token Count Comparison

English - "The cat sat on the mat"6 tokens

Japanese - same meaning18 tokens

More tokens = more compute = slower inference. GPT-2's English-biased vocabulary fragments non-Latin scripts into many small tokens.

1 / 6