Neural-Path/Notes
25 min

Tokenization & BPE

Every language model begins by converting raw text into a sequence of integers — the tokenization step. The vocabulary and algorithm used here directly constrain what the model can express: a vocabulary that handles "running" and "runs" as separate tokens misses morphological structure, while one that decomposes every word into individual characters produces sequences too long for transformers to model efficiently. This lesson derives the Byte Pair Encoding algorithm, explains its variants, and builds intuition for how vocabulary design shapes model behavior.

Theory

BPE Algorithm — Merge Stepscorpus: "low lower newest widest"
low
low</w>
lower
lower</w>
newest
newest</w>
widest
widest</w>

Adjacent pair counts:

"e s" ×2
"l o" ×2
"o w" ×2
"s t" ×2
"e r" ×1
Initial character vocabularyCorpus split into characters. Count all adjacent pairs across all words.

merge 0 / 4

A tokenizer decides what the model considers a "word." Start with individual characters; ask: which pairs of symbols appear so often together that they should be treated as one? Repeat until you reach your target vocabulary size. That greedy merging process is BPE — the diagram above shows how the merge sequence progressively builds up subword units like "est" and "low" from raw characters. The model never sees out-of-vocabulary words because any string can always be decomposed back to bytes.

The Tokenization Problem

A tokenizer is a mapping τ:ΣZ\tau: \Sigma^* \to \mathbb{Z}^* from strings over alphabet Σ\Sigma to integer sequences. Three fundamental strategies trade off vocabulary size against sequence length:

Character tokenization: Σ\Sigma is the character set. Every string decomposes uniquely. Vocabulary is tiny (256–512 tokens), but sequences are very long — a 100-word document produces 500+ character tokens. The transformer's O(n2)O(n^2) attention cost makes this expensive.

Word tokenization: Each unique word in the training corpus is a vocabulary entry. Vocabulary is large (50k–500k) and sequences are short, but unknown words at test time cannot be represented — the out-of-vocabulary (OOV) problem. Rare words receive sparse gradient updates and are poorly learned.

Subword tokenization: The dominant approach. Words decompose into frequent subword units learned from the corpus. "unbelievable" might become (un, believ, able). Vocabulary size (16k–100k) and sequence length (roughly 0.75 words per token for English) are jointly controlled by target vocabulary size VV.

Byte Pair Encoding (BPE)

BPE, adapted from data compression for NLP by Sennrich et al. (2016):

  1. Initialize vocabulary with all unique characters in the corpus (or byte values 0–255 for byte-level BPE)
  2. Add end-of-word symbol </w> to each word
  3. Count all adjacent symbol pairs across the corpus
  4. Merge the most frequent pair everywhere, adding the merged unit to the vocabulary
  5. Repeat steps 3–4 until vocabulary size reaches VV

The merge rule selects the pair (a,b)(a, b) that maximizes count:

merge=argmax(a,b)P  count(a,b)\text{merge}^* = \arg\max_{(a,b) \in \mathcal{P}}\; \text{count}(a, b)

where P\mathcal{P} is the set of all adjacent pairs in the current segmentation. Ties are broken deterministically.

Greedy frequency-based merging is justified because word frequency distributions follow Zipf's law — a small number of token pairs account for most co-occurrences. Merging the most frequent pair first guarantees that the highest-information compressions happen early, so any fixed vocabulary budget is spent on the subword units that carry the most signal. Globally optimal merging (considering all possible merge sequences) is NP-hard; greedy BPE is a practical approximation that empirically matches the quality of optimal solutions at the vocabulary sizes used in practice.

At inference time, the learned merge operations are applied in training order — the encoding is greedy and deterministic: given the ordered merge list, a string has exactly one tokenization.

Complexity: Naive BPE is O(CM)O(C \cdot M) for corpus size CC and MM merges. Implementations using priority queues achieve O(ClogV)O(C \log V) amortized.

WordPiece (BERT)

WordPiece selects merges by maximizing corpus log-likelihood under a unigram LM, equivalent to maximizing pointwise mutual information between the pair:

merge=argmax(a,b)  count(ab)count(a)count(b)\text{merge}^* = \arg\max_{(a,b)}\; \frac{\text{count}(ab)}{\text{count}(a) \cdot \text{count}(b)}

WordPiece prefers merges of symbols that frequently co-occur relative to their marginal frequencies. Subwords continuing a word are prefixed with ## (e.g., "##ing") to distinguish them from word-initial tokens.

Unigram Language Model (SentencePiece)

The Unigram LM tokenizer (Kudo, 2018), used in T5, mBART, and LLaMA, takes a different approach — it starts large and prunes:

  1. Initialize with all substrings up to length LL plus individual characters
  2. Assign each entry a probability p(x)p(x) by fitting a unigram LM: maximize slogP(s)\sum_s \log P(s) where P(s)=xseg(s)p(x)P(s) = \prod_{x \in \text{seg}(s)} p(x) for the most probable segmentation
  3. Remove the bottom pp% of tokens contributing least to corpus likelihood (keeping all single characters)
  4. Repeat until target vocabulary size is reached

The Unigram LM assigns a probability distribution over all possible segmentations, enabling sampling during training — a useful regularization technique. The most probable segmentation is found by the Viterbi algorithm in O(nL)O(n \cdot L) time.

Tiktoken / cl100k_base (GPT-4)

OpenAI's tiktoken implements byte-level BPE. The base alphabet is the 256 possible byte values. This guarantees:

  • No unknown tokens: any byte sequence (arbitrary Unicode, code, binary-embedded text) has a valid tokenization
  • Consistent multilingual coverage: CJK, emoji, and rare scripts are all representable

The cl100k_base vocabulary (GPT-3.5-turbo, GPT-4) has 100,277 tokens. An English word typically becomes 1–3 tokens; a Chinese character 1–2 tokens.

Vocabulary Size Tradeoffs

Small V (1k–8k)Large V (100k+)
Sequence lengthLong (expensive attention)Short (cheaper inference)
Embedding matrixSmallLarge (V×dV \times d parameters)
Rare word handlingSplits to subwordsSparse gradients for rare tokens
MorphologyExplicit in token sequenceOpaque — merged into one token

Fertility measures encoding efficiency for a language:

fertility=number of tokens in corpusnumber of words in corpus\text{fertility} = \frac{\text{number of tokens in corpus}}{\text{number of words in corpus}}

English with GPT-4's tokenizer has fertility ~1.3 (most words are one token). Less frequent languages have higher fertility — a word in a low-resource language may decompose into 5+ byte-level tokens, meaning longer-range dependencies must be captured for a single semantic unit.

Walkthrough

We run BPE manually on a four-word corpus: "low lower newest widest". Each word appears once. Target: add 4 merges.

Initial segmentation (characters, </w> marks end-of-word):

low    → l o w </w>
lower  → l o w e r </w>
newest → n e w e s t </w>
widest → w i d e s t </w>

Count adjacent pairs:

PairCount
e s2 (newest, widest)
l o2 (low, lower)
o w2 (low, lower)
s t2 (newest, widest)
e r1 (lower)

Merge 1: e + s → es (count 2). Result in newest/widest:

newest → n e w es t </w>
widest → w i d es t </w>

Merge 2: es + t → est (count 2). The suffix "est" is now a single unit:

newest → n e w est </w>
widest → w i d est </w>

Merge 3: l + o → lo (count 2). Prefix of "low/lower" captured:

low   → lo w </w>
lower → lo w e r </w>

Merge 4: lo + w → low (count 2). The root "low" is now a single token:

low   → low </w>
lower → low e r </w>

Learned merges (in order): e+s, es+t, l+o, lo+w.

Tokenizing a new word "lowest" at inference time — apply merges in order:

  1. Start: l o w e s t </w>
  2. Apply e+s: l o w es t </w>
  3. Apply es+t: l o w est </w>
  4. Apply l+o: lo w est </w>
  5. Apply lo+w: low est </w>

Result: ["low", "est", "</w>"] — the model correctly identifies the root and suffix despite never seeing "lowest" during vocabulary construction.

Analysis & Evaluation

Where Your Intuition Breaks

The tokenizer is a fixed preprocessing step — you can swap it out without affecting the model. The tokenizer and model are deeply coupled: the embedding matrix has exactly VV rows (one per vocabulary entry), positional encodings are calibrated for typical sequence lengths under that tokenizer, and the model's internal representations encode token-level structure. Changing the tokenizer requires retraining the model from scratch. This coupling is why tokenization decisions — vocabulary size, merge count, byte-level vs character-level base — are made once during pre-training and rarely revisited. A "better" tokenizer does not improve an existing model; it only helps a new one.

Tokenization Strategy Comparison

StrategyVocab SizeOOV HandlingMultilingualSequence LengthBest For
Character100–512NoneGoodVery longMorphologically rich languages
Word50k–500kUNK tokenPoorShortLegacy NLP systems
BPE8k–100kSubword splitGoodMediumMost modern LLMs
WordPiece8k–32kSubword splitGoodMediumBERT-family encoders
Unigram LM8k–100kSubword splitVery goodMediumT5, multilingual models
Byte-level BPE50k–256kNone (0% OOV)ExcellentMedium-longGPT-3/4, multilingual LLMs

Why Bytes-as-Base Eliminates Unknown Tokens

The key insight of byte-level BPE is that 256 byte values are complete: any string — arbitrary Unicode, code, binary-embedded content — has a valid tokenization. Traditional character-level vocabularies covering only training-seen characters will produce [UNK] for rare Unicode. Byte-level BPE trades slightly longer sequences for zero OOV rate.

Tokenization Artifacts and Failure Modes

Whitespace sensitivity: Most BPE tokenizers treat leading whitespace as part of the token. "dog" and " dog" (with leading space) have different embeddings. This causes subtle bugs when concatenating strings for model input.

Morphological opacity: "dog" and "dogs" are often different tokens. The model learns their relationship from co-occurrence statistics rather than explicit morphological structure. Languages with rich inflection are particularly affected.

Number tokenization: "1000000" may tokenize as ("100", "000", "0") or ("1", "000", "000"). This makes arithmetic and number comparison difficult for token-level models.

Capitalization: "GPU" and "gpu" are different tokens. For named entities this is meaningful signal — but the model must independently learn both forms.

Fertility disparity across languages: Since training often samples by token rather than by document, high-fertility languages receive fewer document-level gradient updates per step, contributing to multilingual performance gaps.

💡The compression view of tokenization

BPE is fundamentally a compression algorithm. Each merge replaces a two-symbol pattern with a single symbol, reducing total sequence length. A good vocabulary compresses the training corpus efficiently — frequent patterns get single tokens, rare patterns decompose. Vocabulary size is the compression ratio parameter: larger vocabulary means higher compression (shorter sequences) at the cost of more parameters in the embedding table and sparser gradient updates for rare tokens.

⚠️Never preprocess text before a byte-level tokenizer

Byte-level tokenizers (tiktoken, SentencePiece with byte fallback) handle arbitrary Unicode natively. Stripping punctuation, lowercasing, or removing HTML before passing text to these tokenizers alters the token sequence in ways that may not match the distribution the model was trained on. For LLM APIs, pass raw (or minimally cleaned) text and let the tokenizer handle it.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.