Tokenization & BPE

Every language model begins by converting raw text into a sequence of integers — the tokenization step. The vocabulary and algorithm used here directly constrain what the model can express: a vocabulary that handles "running" and "runs" as separate tokens misses morphological structure, while one that decomposes every word into individual characters produces sequences too long for transformers to model efficiently. This lesson derives the Byte Pair Encoding algorithm, explains its variants, and builds intuition for how vocabulary design shapes model behavior.

Theory

BPE Algorithm — Merge Stepscorpus: "low lower newest widest"

low→

low</w>

lower→

lower</w>

newest→

newest</w>

widest→

widest</w>

Adjacent pair counts:

"e s" ×2

"l o" ×2

"o w" ×2

"s t" ×2

"e r" ×1

Initial character vocabularyCorpus split into characters. Count all adjacent pairs across all words.

merge 0 / 4

A tokenizer decides what the model considers a "word." Start with individual characters; ask: which pairs of symbols appear so often together that they should be treated as one? Repeat until you reach your target vocabulary size. That greedy merging process is BPE — the diagram above shows how the merge sequence progressively builds up subword units like "est" and "low" from raw characters. The model never sees out-of-vocabulary words because any string can always be decomposed back to bytes.

The Tokenization Problem

A tokenizer is a mapping $\tau: \Sigma^* \to \mathbb{Z}^*$ from strings over alphabet $\Sigma$ to integer sequences. Three fundamental strategies trade off vocabulary size against sequence length:

Character tokenization: $\Sigma$ is the character set. Every string decomposes uniquely. Vocabulary is tiny (256–512 tokens), but sequences are very long — a 100-word document produces 500+ character tokens. The transformer's $O(n^2)$ attention cost makes this expensive.

Word tokenization: Each unique word in the training corpus is a vocabulary entry. Vocabulary is large (50k–500k) and sequences are short, but unknown words at test time cannot be represented — the out-of-vocabulary (OOV) problem. Rare words receive sparse gradient updates and are poorly learned.

Subword tokenization: The dominant approach. Words decompose into frequent subword units learned from the corpus. "unbelievable" might become (un, believ, able). Vocabulary size (16k–100k) and sequence length (roughly 0.75 words per token for English) are jointly controlled by target vocabulary size $V$ .

Byte Pair Encoding (BPE)

BPE, adapted from data compression for NLP by Sennrich et al. (2016):

Initialize vocabulary with all unique characters in the corpus (or byte values 0–255 for byte-level BPE)
Add end-of-word symbol </w> to each word
Count all adjacent symbol pairs across the corpus
Merge the most frequent pair everywhere, adding the merged unit to the vocabulary
Repeat steps 3–4 until vocabulary size reaches $V$

The merge rule selects the pair $(a, b)$ that maximizes count:

$\text{merge}^* = \arg\max_{(a,b) \in \mathcal{P}}\; \text{count}(a, b)$

where $\mathcal{P}$ is the set of all adjacent pairs in the current segmentation. Ties are broken deterministically.

Greedy frequency-based merging is justified because word frequency distributions follow Zipf's law — a small number of token pairs account for most co-occurrences. Merging the most frequent pair first guarantees that the highest-information compressions happen early, so any fixed vocabulary budget is spent on the subword units that carry the most signal. Globally optimal merging (considering all possible merge sequences) is NP-hard; greedy BPE is a practical approximation that empirically matches the quality of optimal solutions at the vocabulary sizes used in practice.

At inference time, the learned merge operations are applied in training order — the encoding is greedy and deterministic: given the ordered merge list, a string has exactly one tokenization.

Complexity: Naive BPE is $O(C \cdot M)$ for corpus size $C$ and $M$ merges. Implementations using priority queues achieve $O(C \log V)$ amortized.

WordPiece (BERT)

WordPiece selects merges by maximizing corpus log-likelihood under a unigram LM, equivalent to maximizing pointwise mutual information between the pair:

$\text{merge}^* = \arg\max_{(a,b)}\; \frac{\text{count}(ab)}{\text{count}(a) \cdot \text{count}(b)}$

WordPiece prefers merges of symbols that frequently co-occur relative to their marginal frequencies. Subwords continuing a word are prefixed with ## (e.g., "##ing") to distinguish them from word-initial tokens.

Unigram Language Model (SentencePiece)

The Unigram LM tokenizer (Kudo, 2018), used in T5, mBART, and LLaMA, takes a different approach — it starts large and prunes:

Initialize with all substrings up to length $L$ plus individual characters
Assign each entry a probability $p(x)$ by fitting a unigram LM: maximize $\sum_s \log P(s)$ where $P(s) = \prod_{x \in \text{seg}(s)} p(x)$ for the most probable segmentation
Remove the bottom $p$ % of tokens contributing least to corpus likelihood (keeping all single characters)
Repeat until target vocabulary size is reached

The Unigram LM assigns a probability distribution over all possible segmentations, enabling sampling during training — a useful regularization technique. The most probable segmentation is found by the Viterbi algorithm in $O(n \cdot L)$ time.

Tiktoken / cl100k_base (GPT-4)

OpenAI's tiktoken implements byte-level BPE. The base alphabet is the 256 possible byte values. This guarantees:

No unknown tokens: any byte sequence (arbitrary Unicode, code, binary-embedded text) has a valid tokenization
Consistent multilingual coverage: CJK, emoji, and rare scripts are all representable

The cl100k_base vocabulary (GPT-3.5-turbo, GPT-4) has 100,277 tokens. An English word typically becomes 1–3 tokens; a Chinese character 1–2 tokens.

Vocabulary Size Tradeoffs

	Small V (1k–8k)	Large V (100k+)
Sequence length	Long (expensive attention)	Short (cheaper inference)
Embedding matrix	Small	Large ( $V \times d$ parameters)
Rare word handling	Splits to subwords	Sparse gradients for rare tokens
Morphology	Explicit in token sequence	Opaque — merged into one token

Fertility measures encoding efficiency for a language:

$\text{fertility} = \frac{\text{number of tokens in corpus}}{\text{number of words in corpus}}$

English with GPT-4's tokenizer has fertility ~1.3 (most words are one token). Less frequent languages have higher fertility — a word in a low-resource language may decompose into 5+ byte-level tokens, meaning longer-range dependencies must be captured for a single semantic unit.

Walkthrough

We run BPE manually on a four-word corpus: "low lower newest widest". Each word appears once. Target: add 4 merges.

Initial segmentation (characters, </w> marks end-of-word):

low    → l o w </w>
lower  → l o w e r </w>
newest → n e w e s t </w>
widest → w i d e s t </w>

Count adjacent pairs:

Pair	Count
`e s`	2 (newest, widest)
`l o`	2 (low, lower)
`o w`	2 (low, lower)
`s t`	2 (newest, widest)
`e r`	1 (lower)

Merge 1: e + s → es (count 2). Result in newest/widest:

newest → n e w es t </w>
widest → w i d es t </w>

Merge 2: es + t → est (count 2). The suffix "est" is now a single unit:

newest → n e w est </w>
widest → w i d est </w>

Merge 3: l + o → lo (count 2). Prefix of "low/lower" captured:

low   → lo w </w>
lower → lo w e r </w>

Merge 4: lo + w → low (count 2). The root "low" is now a single token:

low   → low </w>
lower → low e r </w>

Learned merges (in order): e+s, es+t, l+o, lo+w.

Tokenizing a new word "lowest" at inference time — apply merges in order:

Start: l o w e s t </w>
Apply e+s: l o w es t </w>
Apply es+t: l o w est </w>
Apply l+o: lo w est </w>
Apply lo+w: low est </w>

Result: ["low", "est", "</w>"] — the model correctly identifies the root and suffix despite never seeing "lowest" during vocabulary construction.

Analysis & Evaluation

Where Your Intuition Breaks

The tokenizer is a fixed preprocessing step — you can swap it out without affecting the model. The tokenizer and model are deeply coupled: the embedding matrix has exactly $V$ rows (one per vocabulary entry), positional encodings are calibrated for typical sequence lengths under that tokenizer, and the model's internal representations encode token-level structure. Changing the tokenizer requires retraining the model from scratch. This coupling is why tokenization decisions — vocabulary size, merge count, byte-level vs character-level base — are made once during pre-training and rarely revisited. A "better" tokenizer does not improve an existing model; it only helps a new one.

Tokenization Strategy Comparison

Strategy	Vocab Size	OOV Handling	Multilingual	Sequence Length	Best For
Character	100–512	None	Good	Very long	Morphologically rich languages
Word	50k–500k	UNK token	Poor	Short	Legacy NLP systems
BPE	8k–100k	Subword split	Good	Medium	Most modern LLMs
WordPiece	8k–32k	Subword split	Good	Medium	BERT-family encoders
Unigram LM	8k–100k	Subword split	Very good	Medium	T5, multilingual models
Byte-level BPE	50k–256k	None (0% OOV)	Excellent	Medium-long	GPT-3/4, multilingual LLMs

Why Bytes-as-Base Eliminates Unknown Tokens

The key insight of byte-level BPE is that 256 byte values are complete: any string — arbitrary Unicode, code, binary-embedded content — has a valid tokenization. Traditional character-level vocabularies covering only training-seen characters will produce [UNK] for rare Unicode. Byte-level BPE trades slightly longer sequences for zero OOV rate.

Tokenization Artifacts and Failure Modes

Whitespace sensitivity: Most BPE tokenizers treat leading whitespace as part of the token. "dog" and " dog" (with leading space) have different embeddings. This causes subtle bugs when concatenating strings for model input.

Morphological opacity: "dog" and "dogs" are often different tokens. The model learns their relationship from co-occurrence statistics rather than explicit morphological structure. Languages with rich inflection are particularly affected.

Number tokenization: "1000000" may tokenize as ("100", "000", "0") or ("1", "000", "000"). This makes arithmetic and number comparison difficult for token-level models.

Capitalization: "GPU" and "gpu" are different tokens. For named entities this is meaningful signal — but the model must independently learn both forms.

Fertility disparity across languages: Since training often samples by token rather than by document, high-fertility languages receive fewer document-level gradient updates per step, contributing to multilingual performance gaps.

💡The compression view of tokenization

BPE is fundamentally a compression algorithm. Each merge replaces a two-symbol pattern with a single symbol, reducing total sequence length. A good vocabulary compresses the training corpus efficiently — frequent patterns get single tokens, rare patterns decompose. Vocabulary size is the compression ratio parameter: larger vocabulary means higher compression (shorter sequences) at the cost of more parameters in the embedding table and sparser gradient updates for rare tokens.

⚠️Never preprocess text before a byte-level tokenizer

Byte-level tokenizers (tiktoken, SentencePiece with byte fallback) handle arbitrary Unicode natively. Stripping punctuation, lowercasing, or removing HTML before passing text to these tokenizers alters the token sequence in ways that may not match the distribution the model was trained on. For LLM APIs, pass raw (or minimally cleaned) text and let the tokenizer handle it.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

ML Fundamentals

Gradient Boosting & Tabular ML

Text Preprocessing