Requires:Attention & Transformers BERT & Encoder Models

Seq2Seq & T5

Not every NLP task is about labeling or embedding a fixed input. Translation, summarization, and question answering all require generating a variable-length output from a variable-length input. This is the sequence-to-sequence (seq2seq) problem. T5 (Text-to-Text Transfer Transformer) unified this family of tasks under a single architecture and training objective, and its design choices — cross-attention, span corruption, task prefixes — remain the blueprint for modern encoder-decoder models.

Theory

encoder (source)click decoder token to trace cross-attentiondecoder (target)

"chat" attends most to "cat" (70%) — Q from decoder, K/V from encoder

weights are illustrative · "The" → "Le" and "the" → "le" both attend to articles

A seq2seq model has two parts: an encoder that reads the full input and compresses it into a set of representations, and a decoder that generates output one token at a time. Cross-attention — shown in the diagram — is how the decoder stays grounded in the source: at every decoding step, it looks back at all encoder positions and decides which parts of the input are most relevant right now. Without cross-attention, the decoder would have to carry the entire source meaning in a single fixed-size vector.

The Encoder-Decoder Architecture

An encoder-decoder Transformer has two sub-networks:

Encoder: processes the full input sequence with bidirectional self-attention, producing a set of contextualized representations $\mathbf{H} = (h_1, h_2, \ldots, h_n) \in \mathbb{R}^{n \times d}$ .

Decoder: generates the output sequence one token at a time. At step $t$ , the decoder has access to:

Previously generated tokens $y_1, \ldots, y_{t-1}$ (via causal self-attention)
The full encoder output $\mathbf{H}$ (via cross-attention)

Cross-Attention

Cross-attention is the mechanism that lets the decoder "read" the encoder output. At each decoder layer, queries come from the decoder, but keys and values come from the encoder:

$\text{CrossAttn}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) = \text{softmax}\!\left(\frac{Q_{\text{dec}} K_{\text{enc}}^\top}{\sqrt{d_k}}\right) V_{\text{enc}}$

where $Q_{\text{dec}} = H_{\text{dec}} W^Q$ and $K_{\text{enc}} = \mathbf{H} W^K$ , $V_{\text{enc}} = \mathbf{H} W^V$ .

This asymmetry is what makes enc-dec different from a pure decoder: instead of computing Q, K, V all from the same sequence, the Q originates in the decoder's current state while K and V are fixed projections of the encoder output.

💡Intuition

In translation, the decoder is asking: "given what I've generated so far, which source words are most relevant to generate the next target word?" The cross-attention scores answer that question. For example, when generating "chat" in French, the decoder should attend strongly to "cat" in the English source.

Full Forward Pass (Decoder at Step $t$ )

The decoder stack applies three sublayers at each layer:

Causal self-attention over $(y_1, \ldots, y_{t-1})$ — decoder tokens can only attend to earlier decoder tokens
Cross-attention with $Q$ from step 1 output and $K, V$ from encoder output $\mathbf{H}$
Feed-forward network on the result

The final decoder hidden state at step $t$ is projected to vocabulary logits, then sampled or greedily decoded.

T5: Text-to-Text Transfer Transformer

T5 (Raffel et al., 2020) made a key unifying bet: every NLP task can be framed as text-in, text-out. Rather than designing task-specific heads, T5 prepends a natural language task prefix to the input:

Task	Input to T5	Expected output
Translation	`translate English to French: The cat sat.`	`Le chat était assis.`
Summarization	`summarize: [long article...]`	`[short summary]`
Sentiment	`sentiment: I loved this film.`	`positive`
QA	`question: Who wrote Hamlet? context: Shakespeare...`	`Shakespeare`

This framing means T5 is fine-tuned for new tasks simply by changing the prefix format — no architecture modification needed.

Span Corruption (T5's Pre-Training Objective)

BERT masks individual tokens (MLM). T5 masks contiguous spans of tokens and replaces each span with a single sentinel token. The model must predict the original spans:

Input: The cat ⟨X⟩ on ⟨Y⟩ mat. Target: ⟨X⟩ sat ⟨Y⟩ the ⟨EOS⟩

Formally, spans are sampled with mean length $\mu = 3$ tokens, covering 15% of input tokens. The target is the concatenation of all original spans (with their sentinels as delimiters).

Advantages over MLM:

The decoder learns to produce contiguous text, not just single tokens — better pretraining for generation tasks
Shorter targets reduce compute: predicting 3-token spans instead of masked positions
Multiple spans per example create richer training signal

python

from transformers import T5ForConditionalGeneration, T5Tokenizer
 
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
 
# T5 text-to-text: task prefix + input
input_text = "summarize: The Eiffel Tower was built between 1887 and 1889. " \
             "It stands 330 meters tall and is the most-visited monument in the world."
 
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=60)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
# → "The Eiffel Tower was built between 1887 and 1889 and stands 330 meters tall."

Walkthrough

Tracing a Summarization Request Through T5

Input (after tokenization):

summarize: Central banks raised interest rates sharply in 2022
to combat inflation. The Federal Reserve increased rates seven
times, bringing the federal funds rate from near zero to over 4%.

Step 1 — Encoder pass. All tokens processed in parallel with full bidirectional self-attention. Output: one 768-dim vector per input token, each informed by the entire input context.

Step 2 — Decoder generates token by token. Decoder starts with just <pad> (the start token):

Step 1: cross-attends to encoder output → strong attention on "Central banks", "raised", "rates" → generates "Central"
Step 2: cross-attends again (still same encoder output, unchanged) → generates "banks"
Step 3: attention shifts toward "2022", "inflation" → generates "raised"
... continues until <EOS> is sampled

Step 3 — Output: "Central banks raised interest rates seven times in 2022 to combat inflation."

Key insight: the encoder output is computed once and never changes during decoding. Every decoder step reads the same $\mathbf{H}$ , but uses different decoder-side queries (because the partially-decoded output grows). This is why encoder-decoder decoding is more expensive than decoder-only: the encoder runs once, then each decode step requires cross-attention over the full encoder output.

Training vs. inference — teacher forcing. During training, the decoder does not consume its own predicted tokens as inputs. Instead it receives the true previous token at every step: step $t$ is conditioned on $y_{t-1}^*$ (the ground-truth token), not $\hat{y}_{t-1}$ (the model's prediction). This technique is called teacher forcing.

Teacher forcing is necessary to make training tractable. Without it, an error at decoding step 3 feeds into step 4, compounds into step 5, and within a few steps the decoder is in a state that the network has never seen during training — gradients become meaningless. Teacher forcing keeps every step conditioned on the true previous token, ensuring clean gradient estimates throughout the sequence. The downside — exposure bias, where the model never sees its own errors during training — is a real problem at inference time, addressed by techniques like scheduled sampling.

Analysis & Evaluation

Where Your Intuition Breaks

The encoder compresses the input into a representation that captures everything important. The original 2014 seq2seq architecture used a single fixed-size bottleneck vector — and it worked surprisingly well on short sequences. For long inputs, it doesn't: a single vector cannot hold the information from a 500-word document without significant loss. This is precisely why Bahdanau attention (1 year later) replaced the bottleneck with attention over all encoder states. The bottleneck assumption is wrong; the fix is now universal.

Enc-Decoder vs. Decoder-Only for Generation Tasks

The enc-dec architecture was dominant until ~2022. Since then, decoder-only models have largely taken over, even for generation tasks. Why?

Property	Encoder-Decoder (T5, BART)	Decoder-Only (GPT, Llama, Claude)
Architecture	Encoder + cross-attention + decoder	Decoder only
Input representation	Rich bidirectional (encoder)	Causal (left-to-right only)
Pre-training	Span corruption / denoising	Next-token prediction
Scaling behavior	Good at moderate size	Scales more efficiently to very large
Few-shot ability	Weak without fine-tuning	Strong (in-context learning)
Seq2seq tasks	Strong (native)	Competitive at scale
Parameters per FLOPs	Efficient (encoder shared)	Less efficient at small scale

Why enc-dec still wins in specific cases:

Strict input-output boundary matters (you never want input tokens in the output): translation, structured extraction
Context much longer than output: summarization of very long documents
Latency budget for encoder is OK: you can run encoder once and cache for many decoding calls

Why decoder-only has taken over:

In-context learning (few-shot prompting) works without fine-tuning
Simpler KV cache infrastructure — only one model to optimize
Better emergent reasoning at scale (>13B params)

T5 Variants

Model	Key change	When to use
T5-base / T5-large	Original 2020 model	Fine-tuning baselines
T5-v1.1	No dropout on pre-training; better init	Preferred over original T5
mT5	Multilingual (101 languages)	Cross-lingual tasks
FLAN-T5	Instruction-tuned on 1,836 tasks	Zero/few-shot via prompting, best default choice
FLAN-UL2	Mixture-of-Denoisers pre-training	Long-context tasks

For most new projects requiring seq2seq: start with FLAN-T5-large (780M params). It handles prompting without fine-tuning, and fine-tunes well when you have labeled data.

🚀Production

When to reach for T5 / enc-dec:

Summarization, translation, data-to-text, structured extraction — tasks where input and output are clearly separated
You have labeled (input, output) pairs and budget to fine-tune — enc-dec fine-tuning is highly sample-efficient
Latency matters and inputs are long: encode once, decode fast

When to skip it:

You need strong zero-shot generalization without fine-tuning — prefer a decoder-only model
Tasks require reasoning over very long context (32K+) — enc-dec context windows are smaller
You're building a chatbot or instruction-following assistant — decoder-only is the natural fit

Practical tips:

Task prefix format matters: "summarize: " and "tldr: " produce different outputs from the same model
Max decoder length (max_new_tokens) must be set explicitly — T5 won't stop unless <EOS> is generated or the limit is hit
Use num_beams=4 for quality-sensitive tasks (translation, summarization); greedy for speed-sensitive tasks
FLAN-T5 checkpoint sizes: small (80M), base (250M), large (780M), xl (3B), xxl (11B)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

BERT & Encoder Models

Mixture of Experts