N-gram Language Models

A language model assigns a probability to every sequence of words. Before neural networks, the dominant approach was the n-gram model: estimate the probability of the next word from the preceding n−1 words, using counts from a large text corpus. Understanding n-gram models is necessary groundwork for evaluating any language model — perplexity, the universal LM metric, is defined in terms of the probability a model assigns to text.

Theory

corpus: 4 recipe sentences

context:

P(w₂ | <s>) = C(<s>, w₂) / C(<s>)

add

0.50

C(<s>, add) / C(<s>) = 2 / 4

stir

0.50

C(<s>, stir) / C(<s>) = 2 / 4

Hover each n-gram above to see how probability mass is estimated from counts. A unigram model treats each word independently; a bigram model conditions on the previous word; a trigram on the two previous words. The visualization shows why longer contexts give more precise predictions but suffer from sparsity — most trigrams never appear in training data.

The chain rule decomposes the probability of any sequence exactly:

$P(w_1, w_2, \ldots, w_n) = \prod_{i=1}^{n} P(w_i \mid w_1, \ldots, w_{i-1})$

The Markov assumption — that the next word depends only on the preceding k words — approximates this tractably. For a bigram (k=1):

$P(w_1, \ldots, w_n) \approx \prod_{i=1}^{n} P(w_i \mid w_{i-1})$

The Markov assumption had to truncate the conditioning context because exact conditioning on the full history makes estimation impossible: the number of distinct word sequences grows exponentially with length, so most long histories never appear in any finite corpus. Bigram and trigram counts are sparse enough to require smoothing; conditioning on 10 previous words would make estimation completely infeasible. The assumption is wrong (words depend on arbitrarily distant context) but it is the strongest assumption that remains estimable from data.

Maximum Likelihood Estimation

N-gram probabilities are estimated directly from corpus counts:

$P_{\text{MLE}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$

where C(·) denotes count in the training corpus. MLE is exact but fragile: any n-gram not seen in training gets probability zero, making the probability of any unseen test sentence zero.

Smoothing

Smoothing redistributes some probability mass from seen n-grams to unseen ones.

Laplace (add-1) smoothing: add 1 to every count before normalizing:

$P_{\text{Laplace}}(w_i \mid w_{i-1}) = \frac{C(w_{i-1}, w_i) + 1}{C(w_{i-1}) + |V|}$

where |V| is vocabulary size. Simple but over-smooths — too much mass goes to unseen events.

Kneser-Ney smoothing: the dominant practical method. It discounts counts by a fixed amount δ ∈ (0,1) and redistributes to a lower-order model weighted by continuation probability — how many distinct contexts a word appears in, not how often it appears:

$P_{\text{KN}}(w_i \mid w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - \delta, 0)}{C(w_{i-1})} + \lambda(w_{i-1}) \cdot P_{\text{KN}}(w_i)$

The continuation probability $P_{\text{KN}}(w_i) \propto |\{v : C(v, w_i) > 0\}|$ captures versatility: "Francisco" appears often but almost always after "San," so it gets low continuation probability and low backoff weight. Kneser-Ney is the standard baseline that n-gram systems are judged against.

Perplexity

Perplexity measures how well a language model predicts a held-out test corpus — lower is better:

$\text{PPL}(W) = P(w_1, w_2, \ldots, w_N)^{-1/N} = \exp\!\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i \mid w_{i-1})\right)$

Perplexity is the geometric mean inverse probability per word — equivalently, the weighted average branching factor the model considers at each step. A perplexity of 100 means the model is, on average, as uncertain as if it chose uniformly over 100 equally likely words at each position.

Perplexity had to be defined as a geometric mean over the test sequence (not a raw sum of log-probabilities) so that it is comparable across sequences of different lengths. A raw log-probability sum grows with sequence length and cannot be compared across test sets. Normalizing by N makes perplexity an intrinsic property of the model's predictions, not of the test set size.

Walkthrough

Corpus: 4 sentences from a toy recipe domain.

<s> add salt to the water </s>
<s> add pepper to the sauce </s>
<s> stir the water slowly </s>
<s> stir the sauce gently </s>

Step 1 — Count bigrams:

Bigram	Count
`<s> add`	2
`add salt`	1
`add pepper`	1
`to the`	2
`the water`	2
`the sauce`	2
`stir the`	2

Step 2 — Compute MLE bigram probabilities:

P(salt | add) = 1/2, P(pepper | add) = 1/2 P(water | the) = 2/4 = 0.5, P(sauce | the) = 0.5

Step 3 — Score a new sentence:

"add pepper to the water"

$P = P(\text{add}|\text{<s>}) \cdot P(\text{pepper}|\text{add}) \cdot P(\text{to}|\text{pepper}) \cdot P(\text{the}|\text{to}) \cdot P(\text{water}|\text{the})$

P(to | pepper) = 0 (never seen in training) → entire probability = 0.

This is the zero-probability problem. With Laplace smoothing (|V| = 10):

$P_{\text{Laplace}}(\text{to}|\text{pepper}) = \frac{0 + 1}{1 + 10} = \frac{1}{11} \approx 0.09$

python

from collections import defaultdict, Counter
import math
 
def build_bigram_model(sentences: list[list[str]], delta: float = 0.0):
    """MLE bigram model with optional Laplace smoothing (delta=1 for add-1)."""
    unigrams = Counter()
    bigrams  = Counter()
 
    for sent in sentences:
        tokens = ["<s>"] + sent + ["</s>"]
        for w in tokens:
            unigrams[w] += 1
        for w1, w2 in zip(tokens, tokens[1:]):
            bigrams[(w1, w2)] += 1
 
    vocab_size = len(unigrams)
 
    def prob(w2: str, w1: str) -> float:
        num = bigrams[(w1, w2)] + delta
        den = unigrams[w1] + delta * vocab_size
        return num / den if den > 0 else 0.0
 
    return prob
 
def perplexity(prob_fn, test_sentences: list[list[str]]) -> float:
    total_log_prob = 0.0
    total_tokens   = 0
 
    for sent in test_sentences:
        tokens = ["<s>"] + sent + ["</s>"]
        for w1, w2 in zip(tokens, tokens[1:]):
            p = prob_fn(w2, w1)
            total_log_prob += math.log(p) if p > 0 else float("-inf")
            total_tokens   += 1
 
    return math.exp(-total_log_prob / total_tokens)
 
# Build and evaluate
train = [
    ["add", "salt", "to", "the", "water"],
    ["add", "pepper", "to", "the", "sauce"],
    ["stir", "the", "water", "slowly"],
    ["stir", "the", "sauce", "gently"],
]
test = [["add", "pepper", "to", "the", "water"]]
 
mle_model     = build_bigram_model(train, delta=0.0)
smooth_model  = build_bigram_model(train, delta=1.0)
 
print(f"MLE perplexity:     {perplexity(mle_model, test)}")      # inf (zero prob)
print(f"Laplace perplexity: {perplexity(smooth_model, test):.2f}")  # finite

Analysis & Evaluation

Where Your Intuition Breaks

Perplexity improvements on a development set always translate to better downstream task performance. Perplexity measures how well a model predicts the next word in a corpus — this is not the same as how well it performs on a downstream task like question answering or summarization. A model can achieve lower perplexity by over-fitting the genre or domain of the evaluation corpus (e.g., a model trained and evaluated on news text looks excellent on perplexity but fails on conversational text). Neural LMs consistently achieve lower perplexity than n-gram models on standard benchmarks, but for narrow, specialized domains with small corpora (medical transcripts, legal documents, code in a specific framework), n-gram models often remain competitive because the n-gram distribution closely matches the test domain and there is insufficient data to train a neural model. Perplexity is a good intrinsic metric but an incomplete proxy for task performance.

N-gram Order Trade-offs

Order	Sparsity	Context captured	Smoothing burden
Unigram	None	None (word frequency only)	Minimal
Bigram	Low	One previous word	Low
Trigram	High	Two previous words	Medium
4-gram+	Very high	Longer phrases	High (Kneser-Ney required)

Trigram models with Kneser-Ney smoothing were the practical optimum for pre-neural LMs. Higher orders provide diminishing returns because of sparsity.

N-gram LMs vs Neural LMs

Property	N-gram	Neural (LSTM/Transformer)
Context window	Fixed, small (2-5 words)	Long (hundreds to thousands)
Parameter count	Vocabulary-dependent, can be huge	Fixed architecture, smaller in practice
Training data needed	Small (millions of tokens)	Large (billions of tokens)
Inference speed	Very fast (table lookup)	Slower (matrix multiplication)
Out-of-vocabulary words	Problematic (zero counts)	Handled via subword tokenization
Interpretability	High (counts are inspectable)	Low

N-gram models remain useful as baselines and for data augmentation sanity checks: a sentence that receives low probability under a trigram model is likely ungrammatical or domain-mismatched, regardless of what a neural model says.

Key Applications

Perplexity as a training signal: N-gram perplexity on a small, high-quality reference corpus is used as a data quality filter in pre-training pipelines — documents with perplexity far above or below the reference corpus are likely spam or templated content and are discarded.

Speech recognition rescoring: N-gram LMs are fast enough to rescore the beam during decoding. Hybrid systems combine acoustic models with n-gram LMs for efficiency.

Machine translation prefix scoring: N-gram models over the target language prune unlikely partial translations early in beam search.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Text Preprocessing

Classical Text Classification