How BPE builds a vocabulary from raw bytes. Merge rules, byte-level encoding, vocabulary design, and why tokenizer choices shape everything downstream.
01The vocabulary problem
Why Build a Tokenizer?
A language model can't process raw text - it needs a fixed vocabulary of discrete tokens. The tokenizer is trained BEFORE the model, and its choices permanently shape everything downstream: what the model can represent efficiently, how long sequences become, and which languages work well.
A bad tokenizer can make your model waste capacity. If 'unhappiness' is one token in English but 6 tokens in another language, the model needs 6x more positions (and attention computation) for the same concept. GPT-2's tokenizer was trained mostly on English web text, which is why it tokenizes non-English text and code inefficiently. LLaMA 3 and Qwen 2.5 trained on multilingual data, resulting in 128K-152K token vocabularies that handle many languages well.
The Vocabulary Problem
Same meaning - "The cat sat on the mat" - across languages
EnglishThe cat sat on the mat
6
SpanishEl gato se sentó en la alfombra
10
Japanese猫がマットの上に座った
18
Arabicجلست القطة على الحصيرة
22
Korean고양이가 매트 위에 앉았다
16
Chinese猫坐在垫子上
12
Token Count Comparison
English - "The cat sat on the mat"6 tokens
Japanese - same meaning18 tokens
More tokens = more compute = slower inference. GPT-2's English-biased vocabulary fragments non-Latin scripts into many small tokens.