N-gram Language Models
A language model assigns a probability to every sequence of words. Before neural networks, the dominant approach was the n-gram model: estimate the probability of the next word from the preceding n−1 words, using counts from a large text corpus. Understanding n-gram models is necessary groundwork for evaluating any language model — perplexity, the universal LM metric, is defined in terms of the probability a model assigns to text.
Theory
Hover each n-gram above to see how probability mass is estimated from counts. A unigram model treats each word independently; a bigram model conditions on the previous word; a trigram on the two previous words. The visualization shows why longer contexts give more precise predictions but suffer from sparsity — most trigrams never appear in training data.
The chain rule decomposes the probability of any sequence exactly:
The Markov assumption — that the next word depends only on the preceding k words — approximates this tractably. For a bigram (k=1):
The Markov assumption had to truncate the conditioning context because exact conditioning on the full history makes estimation impossible: the number of distinct word sequences grows exponentially with length, so most long histories never appear in any finite corpus. Bigram and trigram counts are sparse enough to require smoothing; conditioning on 10 previous words would make estimation completely infeasible. The assumption is wrong (words depend on arbitrarily distant context) but it is the strongest assumption that remains estimable from data.
Maximum Likelihood Estimation
N-gram probabilities are estimated directly from corpus counts:
where C(·) denotes count in the training corpus. MLE is exact but fragile: any n-gram not seen in training gets probability zero, making the probability of any unseen test sentence zero.
Smoothing
Smoothing redistributes some probability mass from seen n-grams to unseen ones.
Laplace (add-1) smoothing: add 1 to every count before normalizing:
where |V| is vocabulary size. Simple but over-smooths — too much mass goes to unseen events.
Kneser-Ney smoothing: the dominant practical method. It discounts counts by a fixed amount δ ∈ (0,1) and redistributes to a lower-order model weighted by continuation probability — how many distinct contexts a word appears in, not how often it appears:
The continuation probability captures versatility: "Francisco" appears often but almost always after "San," so it gets low continuation probability and low backoff weight. Kneser-Ney is the standard baseline that n-gram systems are judged against.
Perplexity
Perplexity measures how well a language model predicts a held-out test corpus — lower is better:
Perplexity is the geometric mean inverse probability per word — equivalently, the weighted average branching factor the model considers at each step. A perplexity of 100 means the model is, on average, as uncertain as if it chose uniformly over 100 equally likely words at each position.
Perplexity had to be defined as a geometric mean over the test sequence (not a raw sum of log-probabilities) so that it is comparable across sequences of different lengths. A raw log-probability sum grows with sequence length and cannot be compared across test sets. Normalizing by N makes perplexity an intrinsic property of the model's predictions, not of the test set size.
Walkthrough
Corpus: 4 sentences from a toy recipe domain.
<s> add salt to the water </s>
<s> add pepper to the sauce </s>
<s> stir the water slowly </s>
<s> stir the sauce gently </s>
Step 1 — Count bigrams:
| Bigram | Count |
|---|---|
<s> add | 2 |
add salt | 1 |
add pepper | 1 |
to the | 2 |
the water | 2 |
the sauce | 2 |
stir the | 2 |
Step 2 — Compute MLE bigram probabilities:
P(salt | add) = 1/2, P(pepper | add) = 1/2 P(water | the) = 2/4 = 0.5, P(sauce | the) = 0.5
Step 3 — Score a new sentence:
"add pepper to the water"
P(to | pepper) = 0 (never seen in training) → entire probability = 0.
This is the zero-probability problem. With Laplace smoothing (|V| = 10):
from collections import defaultdict, Counter
import math
def build_bigram_model(sentences: list[list[str]], delta: float = 0.0):
"""MLE bigram model with optional Laplace smoothing (delta=1 for add-1)."""
unigrams = Counter()
bigrams = Counter()
for sent in sentences:
tokens = ["<s>"] + sent + ["</s>"]
for w in tokens:
unigrams[w] += 1
for w1, w2 in zip(tokens, tokens[1:]):
bigrams[(w1, w2)] += 1
vocab_size = len(unigrams)
def prob(w2: str, w1: str) -> float:
num = bigrams[(w1, w2)] + delta
den = unigrams[w1] + delta * vocab_size
return num / den if den > 0 else 0.0
return prob
def perplexity(prob_fn, test_sentences: list[list[str]]) -> float:
total_log_prob = 0.0
total_tokens = 0
for sent in test_sentences:
tokens = ["<s>"] + sent + ["</s>"]
for w1, w2 in zip(tokens, tokens[1:]):
p = prob_fn(w2, w1)
total_log_prob += math.log(p) if p > 0 else float("-inf")
total_tokens += 1
return math.exp(-total_log_prob / total_tokens)
# Build and evaluate
train = [
["add", "salt", "to", "the", "water"],
["add", "pepper", "to", "the", "sauce"],
["stir", "the", "water", "slowly"],
["stir", "the", "sauce", "gently"],
]
test = [["add", "pepper", "to", "the", "water"]]
mle_model = build_bigram_model(train, delta=0.0)
smooth_model = build_bigram_model(train, delta=1.0)
print(f"MLE perplexity: {perplexity(mle_model, test)}") # inf (zero prob)
print(f"Laplace perplexity: {perplexity(smooth_model, test):.2f}") # finiteAnalysis & Evaluation
Where Your Intuition Breaks
Perplexity improvements on a development set always translate to better downstream task performance. Perplexity measures how well a model predicts the next word in a corpus — this is not the same as how well it performs on a downstream task like question answering or summarization. A model can achieve lower perplexity by over-fitting the genre or domain of the evaluation corpus (e.g., a model trained and evaluated on news text looks excellent on perplexity but fails on conversational text). Neural LMs consistently achieve lower perplexity than n-gram models on standard benchmarks, but for narrow, specialized domains with small corpora (medical transcripts, legal documents, code in a specific framework), n-gram models often remain competitive because the n-gram distribution closely matches the test domain and there is insufficient data to train a neural model. Perplexity is a good intrinsic metric but an incomplete proxy for task performance.
N-gram Order Trade-offs
| Order | Sparsity | Context captured | Smoothing burden |
|---|---|---|---|
| Unigram | None | None (word frequency only) | Minimal |
| Bigram | Low | One previous word | Low |
| Trigram | High | Two previous words | Medium |
| 4-gram+ | Very high | Longer phrases | High (Kneser-Ney required) |
Trigram models with Kneser-Ney smoothing were the practical optimum for pre-neural LMs. Higher orders provide diminishing returns because of sparsity.
N-gram LMs vs Neural LMs
| Property | N-gram | Neural (LSTM/Transformer) |
|---|---|---|
| Context window | Fixed, small (2-5 words) | Long (hundreds to thousands) |
| Parameter count | Vocabulary-dependent, can be huge | Fixed architecture, smaller in practice |
| Training data needed | Small (millions of tokens) | Large (billions of tokens) |
| Inference speed | Very fast (table lookup) | Slower (matrix multiplication) |
| Out-of-vocabulary words | Problematic (zero counts) | Handled via subword tokenization |
| Interpretability | High (counts are inspectable) | Low |
N-gram models remain useful as baselines and for data augmentation sanity checks: a sentence that receives low probability under a trigram model is likely ungrammatical or domain-mismatched, regardless of what a neural model says.
Key Applications
Perplexity as a training signal: N-gram perplexity on a small, high-quality reference corpus is used as a data quality filter in pre-training pipelines — documents with perplexity far above or below the reference corpus are likely spam or templated content and are discarded.
Speech recognition rescoring: N-gram LMs are fast enough to rescore the beam during decoding. Hybrid systems combine acoustic models with n-gram LMs for efficiency.
Machine translation prefix scoring: N-gram models over the target language prune unlikely partial translations early in beam search.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.