Sequence Labeling
Sequence labeling assigns a tag to every token in a sequence. Named entity recognition (NER), part-of-speech (POS) tagging, and chunking are all sequence labeling tasks. Unlike classification, labels are not independent — the tag of a token depends on its neighbors. This dependency structure is what separates the HMM and CRF from classifiers applied token-by-token.
Theory
The lattice above shows all possible tag assignments for "Apple was founded in California." Switch to "Viterbi Path" to see the globally optimal sequence found by dynamic programming. The key insight: the CRF evaluates all paths simultaneously and returns the single globally optimal sequence — not a greedy left-to-right choice that can't correct early mistakes.
Problem Formulation
Given token sequence x = (x₁, x₂, …, xₙ), predict label sequence y = (y₁, y₂, …, yₙ). The label set follows the BIO scheme:
- B-TYPE — beginning of an entity of TYPE
- I-TYPE — inside an entity of TYPE
- O — outside any entity
BIO encoding enforces structural constraints: I-ORG cannot follow O (an entity inside tag must be preceded by a B or another I of the same type). The CRF learns to respect these constraints through transition weights.
Hidden Markov Model (HMM)
The HMM models the joint probability of observations and labels as a product of transition and emission probabilities:
- Transition probability : probability of moving from tag to , estimated from labeled corpus counts
- Emission probability : probability of word given tag
Decoding uses the Viterbi algorithm: dynamic programming over the lattice of (token, tag) pairs, where is the number of tags.
The HMM had to be a generative model (modeling P(x, y)) rather than a discriminative one because it was designed before efficient training of large conditional models. The generative factorization makes parameter estimation from corpus counts straightforward — but it also means the model wastes capacity modeling P(x), which is irrelevant for prediction.
Linear-Chain CRF
The Conditional Random Field (Lafferty et al., 2001) directly models the conditional distribution P(y | x):
where:
- are feature functions (arbitrary functions of the full input sequence, current tag, and previous tag)
- are learned weights
- is the partition function (normalizer), computed efficiently with the forward algorithm
Feature functions in a linear-chain CRF can capture rich context: whether the current word is capitalized, whether it ends in "-tion", whether the previous word is a title, etc. The HMM emission can only condition on the current token — the CRF feature function can condition on the entire sentence.
The CRF had to use a conditional model (P(y|x) directly) rather than a generative model (P(x,y)) because NLP feature engineering produces overlapping, non-independent features (e.g., word identity AND word suffix AND previous word capitalization all for the same position). Generative models require independence assumptions among features; CRFs impose no such constraint, which is why they dominate HMMs on practical NER benchmarks.
Training: maximize log-likelihood with L2 regularization, computed via the forward-backward algorithm. Decoding: Viterbi algorithm, same as HMM.
Neural Sequence Labeling
Modern NER stacks a CRF on top of a neural encoder:
BiLSTM-CRF (Lample et al., 2016): Bidirectional LSTM produces contextual embeddings for each token; a linear-chain CRF decodes the optimal tag sequence.
BERT + Linear CRF (current standard): BERT subword embeddings → per-token linear projection → CRF decode. The CRF layer adds only transition parameters but enforces global tag consistency.
Walkthrough
Task: NER on English news. Tags: B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, O.
Step 1 — Feature engineering for sklearn-crfsuite:
def word2features(sent, i):
"""Feature function for token at position i."""
word = sent[i]
features = {
'bias': 1.0,
'word.lower()': word.lower(),
'word[-3:]': word[-3:], # suffix: "-tion", "-ing"
'word[-2:]': word[-2:],
'word.isupper()': word.isupper(),
'word.istitle()': word.istitle(), # capitalized: "Apple" vs "apple"
'word.isdigit()': word.isdigit(),
}
if i > 0:
prev = sent[i-1]
features.update({
'-1:word.lower()': prev.lower(),
'-1:word.istitle()': prev.istitle(),
})
else:
features['BOS'] = True # beginning of sentence
if i < len(sent) - 1:
nxt = sent[i+1]
features.update({
'+1:word.lower()': nxt.lower(),
'+1:word.istitle()': nxt.istitle(),
})
else:
features['EOS'] = True # end of sentence
return features
def sent2features(sent): return [word2features(sent, i) for i in range(len(sent))]
def sent2labels(sent): return [tag for _, tag in sent]Step 2 — Train a CRF with sklearn-crfsuite:
import sklearn_crfsuite
from sklearn_crfsuite import metrics
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1, # L1 regularization (encourages sparse features)
c2=0.1, # L2 regularization
max_iterations=100,
all_possible_transitions=True, # learn transitions for unseen tag pairs
)
# train_sents: list of [(word, tag), ...] per sentence
X_train = [sent2features([(w, t) for w, t in s]) for s in train_sents]
y_train = [sent2labels([(w, t) for w, t in s]) for s in train_sents]
crf.fit(X_train, y_train)Step 3 — Evaluate with seqeval (span-level F1):
from seqeval.metrics import classification_report, f1_score
X_test = [sent2features([(w, t) for w, t in s]) for s in test_sents]
y_test = [sent2labels([(w, t) for w, t in s]) for s in test_sents]
y_pred = crf.predict(X_test)
# seqeval: entity-level evaluation, not token-level
# "Apple Inc" must be labeled [B-ORG, I-ORG] exactly — partial credit is 0
print(classification_report(y_test, y_pred))
print(f"Entity F1: {f1_score(y_test, y_pred):.4f}")Typical CoNLL-2003 results:
| Model | Entity F1 |
|---|---|
| Rule-based gazetteers | ~75% |
| HMM | ~83% |
| CRF (hand features) | ~89% |
| BiLSTM-CRF | ~90.9% |
| BERT + CRF | ~92.8% |
Step 4 — BERT-based NER with spaCy or Hugging Face:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
# Hugging Face NER pipeline (BERT fine-tuned on CoNLL-2003)
ner = pipeline(
'ner',
model='dslim/bert-base-NER',
aggregation_strategy='simple', # merge B- and I- spans
)
result = ner("Apple was founded in California by Steve Jobs.")
for ent in result:
print(f"{ent['word']:20s} {ent['entity_group']:8s} {ent['score']:.3f}")
# Apple ORG 0.998
# California LOC 0.997
# Steve Jobs PER 0.999Analysis & Evaluation
Where Your Intuition Breaks
Token-level accuracy is a good metric for NER. Token-level accuracy on NER is misleading — it severely over-reports performance because the majority of tokens are tagged O, and predicting O everywhere achieves high token accuracy. Entity-level F1 (as computed by seqeval) is the standard: an entity is correct only if all its tokens receive the correct tag. Partially labeling "Steve Jobs" as [B-PER, O] instead of [B-PER, I-PER] counts as a complete miss in entity F1, even though it gets one token right. For tasks where every missed entity has a cost (information extraction, clinical NLP), entity-level F1 is the only metric that matters.
Evaluation: Token vs Entity Level
| Metric | What it measures | When to use |
|---|---|---|
| Token accuracy | Per-token tag correctness | Never (dominated by O tokens) |
| Token F1 (micro) | Weighted by token frequency | Rarely |
| Entity F1 (seqeval) | Exact span match | Standard for NER |
| Partial entity credit | Overlap between predicted and gold spans | Information retrieval tasks |
Sequence Labeling Model Comparison
| Model | Context window | Feature learning | Decoding | Best for |
|---|---|---|---|---|
| HMM | Bigram (one previous tag) | No (counts) | Viterbi | Baseline, fast |
| CRF | Full sequence (features of full x) | No (hand features) | Viterbi | Small data, interpretable |
| BiLSTM-CRF | Bidirectional RNN context | Yes | CRF Viterbi | Mid-size datasets |
| BERT + CRF | Full transformer context | Yes (pretrained) | CRF Viterbi | Large datasets, SOTA |
Common Pitfalls
Ignoring the BIO consistency constraint. A classifier applied token-by-token can predict I-ORG after O, which is structurally invalid. The CRF transition matrix (or a simple BIO consistency check post-hoc) prevents this. Without it, entity spans can fragment.
Using token-level accuracy. Report entity F1 (seqeval). Token accuracy is almost always misleading.
Subword tokenization and BERT. BERT tokenizes "California" as one token but "Wolfsberg" might become ["Wolf", "##sberg"]. For NER, only the first subword of each word gets a label — trailing subwords are ignored during evaluation. Use is_split_into_words=True and the word_ids() method from the HuggingFace tokenizer to align predictions back to words.
Cross-domain generalization. NER models trained on news fail on clinical text, legal documents, or social media. Domain-specific entities (drug names, legal terms, informal abbreviations) require domain-adapted training data or fine-tuning.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.