Requires:Attention & Transformers

BERT & Encoder Models

BERT (Bidirectional Encoder Representations from Transformers) marked a turning point in NLP: the same pre-trained model could be adapted to dozens of tasks — classification, named-entity recognition, question answering — with minimal task-specific architecture. Understanding BERT means understanding the encoder-only architecture, masked pre-training, and the pre-train/fine-tune paradigm that still underpins much of production NLP today.

Theory

row = query token · col = key token · click a row to select

Causal — GPT / Llama

Bidirectional — BERT

"not" (row 3) cannot attend to "good" — it hasn't appeared yet.

"not" attends strongly to "good" — BERT understands the negation.

✕ = masked (future token) · weights are illustrative

BERT reads the entire sentence at once — every token attends to every other token simultaneously. The diagram above shows the difference: GPT-style causal attention can only look left; BERT's bidirectional attention sees the whole context. This is why BERT can resolve "bank" in "the river bank" versus "the savings bank" — the surrounding words on both sides determine the meaning.

Bidirectional vs. Causal Attention

The Transformer architecture (see the Attention lesson) supports two attention patterns:

Causal (unidirectional) attention — used in GPT-style decoders — masks future tokens. Each position can attend only to itself and positions to its left:

$A_{ij} = \begin{cases} \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)_{ij} & j \leq i \\ -\infty & j > i \end{cases}$

Bidirectional (full) attention — used in BERT-style encoders — allows each token to attend to every other token in the sequence:

$A = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)$

This distinction is fundamental. A causal model is constrained to produce outputs left-to-right (suitable for generation). A bidirectional encoder builds a richer representation of each token informed by both context directions — better for understanding tasks.

💡Intuition

Think of causal attention as reading a book word by word without being able to look ahead. Bidirectional attention is like reading the whole sentence at once before deciding what each word means. For tasks like "is this review positive?" or "what entity is 'Paris' here?", you want full context — hence encoders.

Pre-Training Objectives

BERT is pre-trained on two self-supervised objectives that require no labeled data:

Masked Language Modeling (MLM): 15% of input tokens are selected. Of those:

80% replaced with [MASK]
10% replaced with a random token
10% left unchanged

The model must predict the original token at each masked position. The loss is cross-entropy over the masked positions only:

$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}})$

where $\mathcal{M}$ is the set of masked indices and $x_{\setminus \mathcal{M}}$ denotes the unmasked context.

Masked Language Modeling forces bidirectionality in a way that next-token prediction cannot. To predict a masked token, the model must use both left and right context — the mask position has no causal direction. By contrast, a left-to-right language model could in principle learn bidirectional representations, but its training objective only rewards left-to-right prediction, so bidirectional attention patterns are never incentivized. The masking strategy is what makes the bidirectionality meaningful.

Next Sentence Prediction (NSP): given two sentence segments A and B, predict whether B actually follows A in the corpus (50% positive, 50% random). The [CLS] token's final representation is used for this binary classification. Note: later work (RoBERTa) showed NSP adds little benefit and removed it.

Special Tokens and Input Format

BERT uses WordPiece tokenization with three special tokens:

Input:  [CLS] The film was great [SEP] I loved it [SEP]
Segment:  A    A   A    A    A     A     B  B     B    B
Position: 0    1   2    3    4     5     6  7     8    9

The final embedding for each token is the sum of three learned embeddings:

$\mathbf{e}_i = \mathbf{e}_i^{\text{token}} + \mathbf{e}_i^{\text{segment}} + \mathbf{e}_i^{\text{position}}$

The [CLS] token at position 0 aggregates sequence-level information — its final hidden state is used as the sentence representation for classification tasks. [SEP] marks segment boundaries.

Fine-Tuning

After pre-training on a large corpus, BERT is adapted to downstream tasks by adding a small task-specific head and fine-tuning the entire model end-to-end:

Task	Input format	Head
Sentence classification	`[CLS] sentence [SEP]`	Linear on `[CLS]` output
Token classification (NER)	`[CLS] tokens [SEP]`	Linear on each token output
Extractive QA	`[CLS] question [SEP] passage [SEP]`	Two linears predicting start/end span
Sentence pair classification	`[CLS] A [SEP] B [SEP]`	Linear on `[CLS]` output

Fine-tuning uses a small learning rate (2–5e-5) for a few epochs (2–4). The pre-trained weights provide a strong initialization that transfers across tasks.

python

from transformers import BertForSequenceClassification, BertTokenizer
import torch
 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
 
text = "This movie was absolutely fantastic!"
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True)
 
# Fine-tuning loop (simplified)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
outputs = model(**inputs, labels=torch.tensor([1]))  # label=1 (positive)
outputs.loss.backward()
optimizer.step()
 
# Inference
with torch.no_grad():
    logits = model(**inputs).logits
    pred = logits.argmax(-1).item()   # 0=negative, 1=positive

Walkthrough

Fine-Tuning BERT for Sentiment Classification

Concretely, here is what happens when you adapt BERT-base to a binary sentiment task (positive / negative reviews):

Tokenization. The input "This film was great" becomes:

[CLS]  This  film  was  great  [SEP]
  0      1     2    3     4      5    ← position IDs
  A      A     A    A     A      A    ← segment IDs (all segment A, single sentence)

Forward pass. BERT runs all 12 Transformer layers over all 6 positions simultaneously. Because attention is bidirectional, every token's representation is influenced by every other token from the first layer onward. After 12 layers, the [CLS] representation is a dense vector encoding the whole sentence.

Classification head. A single nn.Linear(768, 2) projects the [CLS] vector to logits for (negative, positive). That's the only new parameter added — roughly 1,500 weights on top of 110M pre-trained ones.

Fine-tuning dynamics. Training uses a small learning rate (2e-5) for 2–3 epochs. The pre-trained weights barely move: they already encode syntax, semantics, and world knowledge from pre-training on billions of words. The linear head learns quickly what "positive BERT representation" looks like.

Why it works on small data. BERT fine-tuning achieves competitive accuracy with as few as 1,000 labeled examples for many classification tasks — because the pre-trained representations are already high quality. A model trained from scratch on 1,000 examples would perform far worse.

Analysis & Evaluation

Where Your Intuition Breaks

More pretraining data always makes BERT better. After a point, fine-tuning data matters more than pretraining scale. A BERT pretrained on 10x more data but fine-tuned on 100 labeled examples will underperform a smaller model fine-tuned on 10,000 examples on most downstream tasks. BERT's pretraining gives it general language understanding; the fine-tuning data shapes task-specific decision boundaries. For low-resource tasks, better fine-tuning data outweighs pretraining scale.

BERT vs. GPT Architecture Comparison

Property	BERT (encoder-only)	GPT (decoder-only)
Attention	Bidirectional (full)	Causal (masked)
Pre-training	MLM + NSP	Next-token prediction
Strengths	Classification, extraction, embeddings	Generation, few-shot, chat
Max context (base)	512 tokens	512–128K+ tokens
Parameter efficiency	High (shared context)	High (scales better)
Fine-tuning	Supervised (task-specific)	Prompting or PEFT

Why encoders still matter in production: For tasks where you need to embed text into a fixed vector (semantic search, retrieval, clustering), encoder models produce richer per-token representations than decoder models of the same size because they see full bidirectional context. Many production embedding models (sentence-transformers, BGE, E5) are fine-tuned BERT variants.

BERT Variants

The BERT family spawned many improvements. Key ones to know:

Model	Key change	Why it matters
RoBERTa (Liu et al., 2019)	Removed NSP; trained longer, on more data; dynamic masking	Stronger than BERT on most benchmarks
DistilBERT	Knowledge distillation to 66% of BERT size	60% faster inference, 97% of performance
ALBERT	Cross-layer parameter sharing + factorized embedding	Much smaller model, near-BERT accuracy
DeBERTa (He et al., 2020)	Disentangled attention for content + position separately	SOTA on many NLU benchmarks
ModernBERT (2024)	Flash attention, longer context (8K), rotary position embeddings	BERT updated to 2024 training practices

For new projects: prefer ModernBERT or DeBERTa-v3 over original BERT. For embedding tasks: prefer a domain-tuned model from the MTEB leaderboard.

Encoder vs. Decoder vs. Encoder-Decoder

The modern landscape has three architecture families:

Encoder-only (BERT, RoBERTa, DeBERTa): Best for tasks that consume text and produce a label or vector. Cannot generate text.

Decoder-only (GPT, Llama, Claude): Best for generation, instruction-following, reasoning. Can also embed (using last token) but with less efficiency than encoders.

Encoder-Decoder (T5, BART, mT5): Encoder processes input, decoder generates output. Best for sequence-to-sequence tasks: translation, summarization, abstractive QA. The encoder produces rich bidirectional representations; the decoder uses cross-attention to condition generation on them.

The industry trend: decoder-only models have grown dominant for general tasks (zero-shot, few-shot, instruction-following) due to simpler training, better scaling, and stronger emergent capabilities. Encoder-only models remain competitive for latency-sensitive classification and retrieval.

🚀Production

When to use an encoder model:

Semantic search / dense retrieval — encode queries and documents into vectors, compare with cosine similarity
Text classification — sentiment, topic, intent — fast and accurate with fine-tuned BERT
Named entity recognition — reliable token-level predictions
Extractive QA — span prediction in a known passage (not open-domain)

When NOT to use an encoder model:

You need to generate text — use a decoder
Context exceeds 512 tokens — original BERT truncates; use ModernBERT (8K) or a decoder
Zero-shot task — encoders need fine-tuning; decoders can prompt

Practical setup (HuggingFace):

Classification: AutoModelForSequenceClassification
Embeddings: sentence-transformers library with SentenceTransformer('all-MiniLM-L6-v2')
Fine-tuning: use Trainer with learning_rate=2e-5, num_train_epochs=3, weight_decay=0.01
Batch size 32 fits comfortably on a single GPU for BERT-base (110M params)

Embedding model selection: check the MTEB leaderboard for your specific task (retrieval, classification, clustering). A specialized embedding model at 110M params often beats a 7B decoder for retrieval tasks.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Attention & Transformers

Seq2Seq & T5