Sequence Labeling

Sequence labeling assigns a tag to every token in a sequence. Named entity recognition (NER), part-of-speech (POS) tagging, and chunking are all sequence labeling tasks. Unlike classification, labels are not independent — the tag of a token depends on its neighbors. This dependency structure is what separates the HMM and CRF from classifiers applied token-by-token.

Theory

Viterbi algorithm selects the globally optimal tag sequence

Apple

was

founded

California

B-ORG

✓

I-ORG

B-LOC

✓

I-LOC

B-PER

✓

B-ORG

B-LOC

Viterbi: O(T × |tags|²) dynamic programming. CRF adds learned transition weights between adjacent tags (e.g., I-ORG cannot follow O).

The lattice above shows all possible tag assignments for "Apple was founded in California." Switch to "Viterbi Path" to see the globally optimal sequence found by dynamic programming. The key insight: the CRF evaluates all paths simultaneously and returns the single globally optimal sequence — not a greedy left-to-right choice that can't correct early mistakes.

Problem Formulation

Given token sequence x = (x₁, x₂, …, xₙ), predict label sequence y = (y₁, y₂, …, yₙ). The label set follows the BIO scheme:

B-TYPE — beginning of an entity of TYPE
I-TYPE — inside an entity of TYPE
O — outside any entity

BIO encoding enforces structural constraints: I-ORG cannot follow O (an entity inside tag must be preceded by a B or another I of the same type). The CRF learns to respect these constraints through transition weights.

Hidden Markov Model (HMM)

The HMM models the joint probability of observations and labels as a product of transition and emission probabilities:

$P(\mathbf{x}, \mathbf{y}) = \prod_{t=1}^{n} P(y_t \mid y_{t-1}) \cdot P(x_t \mid y_t)$

Transition probability $P(y_t \mid y_{t-1})$ : probability of moving from tag $y_{t-1}$ to $y_t$ , estimated from labeled corpus counts
Emission probability $P(x_t \mid y_t)$ : probability of word $x_t$ given tag $y_t$

Decoding uses the Viterbi algorithm: dynamic programming over the lattice of (token, tag) pairs, $O(n \cdot |T|^2)$ where $|T|$ is the number of tags.

The HMM had to be a generative model (modeling P(x, y)) rather than a discriminative one because it was designed before efficient training of large conditional models. The generative factorization makes parameter estimation from corpus counts straightforward — but it also means the model wastes capacity modeling P(x), which is irrelevant for prediction.

Linear-Chain CRF

The Conditional Random Field (Lafferty et al., 2001) directly models the conditional distribution P(y | x):

$P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\!\left(\sum_{t=1}^{n} \sum_k \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t)\right)$

where:

$f_k(y_{t-1}, y_t, \mathbf{x}, t)$ are feature functions (arbitrary functions of the full input sequence, current tag, and previous tag)
$\lambda_k$ are learned weights
$Z(\mathbf{x}) = \sum_\mathbf{y} \exp(\cdots)$ is the partition function (normalizer), computed efficiently with the forward algorithm

Feature functions in a linear-chain CRF can capture rich context: whether the current word is capitalized, whether it ends in "-tion", whether the previous word is a title, etc. The HMM emission $P(x_t | y_t)$ can only condition on the current token — the CRF feature function can condition on the entire sentence.

The CRF had to use a conditional model (P(y|x) directly) rather than a generative model (P(x,y)) because NLP feature engineering produces overlapping, non-independent features (e.g., word identity AND word suffix AND previous word capitalization all for the same position). Generative models require independence assumptions among features; CRFs impose no such constraint, which is why they dominate HMMs on practical NER benchmarks.

Training: maximize log-likelihood with L2 regularization, computed via the forward-backward algorithm. Decoding: Viterbi algorithm, same as HMM.

Neural Sequence Labeling

Modern NER stacks a CRF on top of a neural encoder:

BiLSTM-CRF (Lample et al., 2016): Bidirectional LSTM produces contextual embeddings for each token; a linear-chain CRF decodes the optimal tag sequence.

BERT + Linear CRF (current standard): BERT subword embeddings → per-token linear projection → CRF decode. The CRF layer adds only $|T|^2$ transition parameters but enforces global tag consistency.

Walkthrough

Task: NER on English news. Tags: B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, O.

Step 1 — Feature engineering for sklearn-crfsuite:

python

def word2features(sent, i):
    """Feature function for token at position i."""
    word = sent[i]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],          # suffix: "-tion", "-ing"
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(), # capitalized: "Apple" vs "apple"
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        prev = sent[i-1]
        features.update({
            '-1:word.lower()': prev.lower(),
            '-1:word.istitle()': prev.istitle(),
        })
    else:
        features['BOS'] = True  # beginning of sentence
 
    if i < len(sent) - 1:
        nxt = sent[i+1]
        features.update({
            '+1:word.lower()': nxt.lower(),
            '+1:word.istitle()': nxt.istitle(),
        })
    else:
        features['EOS'] = True  # end of sentence
 
    return features
 
def sent2features(sent): return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent):   return [tag for _, tag in sent]

Step 2 — Train a CRF with sklearn-crfsuite:

python

import sklearn_crfsuite
from sklearn_crfsuite import metrics
 
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,   # L1 regularization (encourages sparse features)
    c2=0.1,   # L2 regularization
    max_iterations=100,
    all_possible_transitions=True,  # learn transitions for unseen tag pairs
)
 
# train_sents: list of [(word, tag), ...] per sentence
X_train = [sent2features([(w, t) for w, t in s]) for s in train_sents]
y_train = [sent2labels([(w, t) for w, t in s]) for s in train_sents]
 
crf.fit(X_train, y_train)

Step 3 — Evaluate with seqeval (span-level F1):

python

from seqeval.metrics import classification_report, f1_score
 
X_test = [sent2features([(w, t) for w, t in s]) for s in test_sents]
y_test = [sent2labels([(w, t) for w, t in s]) for s in test_sents]
y_pred = crf.predict(X_test)
 
# seqeval: entity-level evaluation, not token-level
# "Apple Inc" must be labeled [B-ORG, I-ORG] exactly — partial credit is 0
print(classification_report(y_test, y_pred))
print(f"Entity F1: {f1_score(y_test, y_pred):.4f}")

Typical CoNLL-2003 results:

Model	Entity F1
Rule-based gazetteers	~75%
HMM	~83%
CRF (hand features)	~89%
BiLSTM-CRF	~90.9%
BERT + CRF	~92.8%

Step 4 — BERT-based NER with spaCy or Hugging Face:

python

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
 
# Hugging Face NER pipeline (BERT fine-tuned on CoNLL-2003)
ner = pipeline(
    'ner',
    model='dslim/bert-base-NER',
    aggregation_strategy='simple',  # merge B- and I- spans
)
 
result = ner("Apple was founded in California by Steve Jobs.")
for ent in result:
    print(f"{ent['word']:20s} {ent['entity_group']:8s} {ent['score']:.3f}")
# Apple                ORG      0.998
# California           LOC      0.997
# Steve Jobs           PER      0.999

Analysis & Evaluation

Where Your Intuition Breaks

Token-level accuracy is a good metric for NER. Token-level accuracy on NER is misleading — it severely over-reports performance because the majority of tokens are tagged O, and predicting O everywhere achieves high token accuracy. Entity-level F1 (as computed by seqeval) is the standard: an entity is correct only if all its tokens receive the correct tag. Partially labeling "Steve Jobs" as [B-PER, O] instead of [B-PER, I-PER] counts as a complete miss in entity F1, even though it gets one token right. For tasks where every missed entity has a cost (information extraction, clinical NLP), entity-level F1 is the only metric that matters.

Evaluation: Token vs Entity Level

Metric	What it measures	When to use
Token accuracy	Per-token tag correctness	Never (dominated by O tokens)
Token F1 (micro)	Weighted by token frequency	Rarely
Entity F1 (seqeval)	Exact span match	Standard for NER
Partial entity credit	Overlap between predicted and gold spans	Information retrieval tasks

Sequence Labeling Model Comparison

Model	Context window	Feature learning	Decoding	Best for
HMM	Bigram (one previous tag)	No (counts)	Viterbi	Baseline, fast
CRF	Full sequence (features of full x)	No (hand features)	Viterbi	Small data, interpretable
BiLSTM-CRF	Bidirectional RNN context	Yes	CRF Viterbi	Mid-size datasets
BERT + CRF	Full transformer context	Yes (pretrained)	CRF Viterbi	Large datasets, SOTA

Common Pitfalls

Ignoring the BIO consistency constraint. A classifier applied token-by-token can predict I-ORG after O, which is structurally invalid. The CRF transition matrix (or a simple BIO consistency check post-hoc) prevents this. Without it, entity spans can fragment.

Using token-level accuracy. Report entity F1 (seqeval). Token accuracy is almost always misleading.

Subword tokenization and BERT. BERT tokenizes "California" as one token but "Wolfsberg" might become ["Wolf", "##sberg"]. For NER, only the first subword of each word gets a label — trailing subwords are ignored during evaluation. Use is_split_into_words=True and the word_ids() method from the HuggingFace tokenizer to align predictions back to words.

Cross-domain generalization. NER models trained on news fail on clinical text, legal documents, or social media. Domain-specific entities (drug names, legal terms, informal abbreviations) require domain-adapted training data or fine-tuning.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Classical Text Classification

Deep Learning

Neural Networks & Backprop