Skip to main content
← Home

Embeddings

From lookup tables to semantic search — 7 chapters

How token IDs become vectors, why similar words cluster in space, how TF-IDF, Word2Vec, and transformer models learn representations, and how sentence embeddings power semantic search and RAG.

01
From token ID to dense vector

The Lookup Table

Every token in the vocabulary maps to a row in the embedding matrix. When the model sees token ID 5432, it reads that row — a dense vector of 768 to 8192 floats. This lookup is just an index operation: no multiplication, no activation.

The embedding matrix has shape [vocab_size × d_model]. For GPT-2: 50,257 × 768 = 38.6M parameters in this one table. In PyTorch: torch.nn.Embedding(vocab_size, d_model). Acting on indices rather than one-hot vectors makes it memory efficient — effectively a linear layer without the matmul. d_model ranges from 768 (GPT-2) to 8192 (LLaMA 3 70B). Weights start random and are learned end-to-end.

Embedding matrix size
vocab size
50,257
×
d_model
768
=
parameters
38.6M

The embedding table is one of the largest weight matrices in the model — just for one lookup.

Click a token to look it up
IDtoken
0[PAD]
1[BOS]
2[EOS]
5432cat
5433dog
5434bank
5435river
5436money

The embedding layer is just a weight matrix used as a lookup table. The weights are learned through backpropagation — the model discovers which directions in this space encode useful features.

1 / 7

forwardpass.dev

An interactive educational project visualizing how LLM inference, training, and deployment work - from raw text to generated response.

Further reading

  • "Attention Is All You Need" - Vaswani et al., 2017
  • "Language Models are Few-Shot Learners" - Brown et al., 2020
  • "The Illustrated Transformer" - Jay Alammar
  • "Neural Networks: Zero to Hero" - Andrej Karpathy

Built with

  • Next.js + TypeScript
  • Framer Motion
  • Tailwind CSS
  • js-tiktoken
Everything runs in your browser - no data is sent to any server.