Seq2Seq & T5
Not every NLP task is about labeling or embedding a fixed input. Translation, summarization, and question answering all require generating a variable-length output from a variable-length input. This is the sequence-to-sequence (seq2seq) problem. T5 (Text-to-Text Transfer Transformer) unified this family of tasks under a single architecture and training objective, and its design choices — cross-attention, span corruption, task prefixes — remain the blueprint for modern encoder-decoder models.
Theory
weights are illustrative · "The" → "Le" and "the" → "le" both attend to articles
A seq2seq model has two parts: an encoder that reads the full input and compresses it into a set of representations, and a decoder that generates output one token at a time. Cross-attention — shown in the diagram — is how the decoder stays grounded in the source: at every decoding step, it looks back at all encoder positions and decides which parts of the input are most relevant right now. Without cross-attention, the decoder would have to carry the entire source meaning in a single fixed-size vector.
The Encoder-Decoder Architecture
An encoder-decoder Transformer has two sub-networks:
Encoder: processes the full input sequence with bidirectional self-attention, producing a set of contextualized representations .
Decoder: generates the output sequence one token at a time. At step , the decoder has access to:
- Previously generated tokens (via causal self-attention)
- The full encoder output (via cross-attention)
Cross-Attention
Cross-attention is the mechanism that lets the decoder "read" the encoder output. At each decoder layer, queries come from the decoder, but keys and values come from the encoder:
where and , .
This asymmetry is what makes enc-dec different from a pure decoder: instead of computing Q, K, V all from the same sequence, the Q originates in the decoder's current state while K and V are fixed projections of the encoder output.
In translation, the decoder is asking: "given what I've generated so far, which source words are most relevant to generate the next target word?" The cross-attention scores answer that question. For example, when generating "chat" in French, the decoder should attend strongly to "cat" in the English source.
Full Forward Pass (Decoder at Step )
The decoder stack applies three sublayers at each layer:
- Causal self-attention over — decoder tokens can only attend to earlier decoder tokens
- Cross-attention with from step 1 output and from encoder output
- Feed-forward network on the result
The final decoder hidden state at step is projected to vocabulary logits, then sampled or greedily decoded.
T5: Text-to-Text Transfer Transformer
T5 (Raffel et al., 2020) made a key unifying bet: every NLP task can be framed as text-in, text-out. Rather than designing task-specific heads, T5 prepends a natural language task prefix to the input:
| Task | Input to T5 | Expected output |
|---|---|---|
| Translation | translate English to French: The cat sat. | Le chat était assis. |
| Summarization | summarize: [long article...] | [short summary] |
| Sentiment | sentiment: I loved this film. | positive |
| QA | question: Who wrote Hamlet? context: Shakespeare... | Shakespeare |
This framing means T5 is fine-tuned for new tasks simply by changing the prefix format — no architecture modification needed.
Span Corruption (T5's Pre-Training Objective)
BERT masks individual tokens (MLM). T5 masks contiguous spans of tokens and replaces each span with a single sentinel token. The model must predict the original spans:
Input: The cat ⟨X⟩ on ⟨Y⟩ mat.
Target: ⟨X⟩ sat ⟨Y⟩ the ⟨EOS⟩
Formally, spans are sampled with mean length tokens, covering 15% of input tokens. The target is the concatenation of all original spans (with their sentinels as delimiters).
Advantages over MLM:
- The decoder learns to produce contiguous text, not just single tokens — better pretraining for generation tasks
- Shorter targets reduce compute: predicting 3-token spans instead of masked positions
- Multiple spans per example create richer training signal
from transformers import T5ForConditionalGeneration, T5Tokenizer
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
# T5 text-to-text: task prefix + input
input_text = "summarize: The Eiffel Tower was built between 1887 and 1889. " \
"It stands 330 meters tall and is the most-visited monument in the world."
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=60)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
# → "The Eiffel Tower was built between 1887 and 1889 and stands 330 meters tall."Walkthrough
Tracing a Summarization Request Through T5
Input (after tokenization):
summarize: Central banks raised interest rates sharply in 2022
to combat inflation. The Federal Reserve increased rates seven
times, bringing the federal funds rate from near zero to over 4%.
Step 1 — Encoder pass. All tokens processed in parallel with full bidirectional self-attention. Output: one 768-dim vector per input token, each informed by the entire input context.
Step 2 — Decoder generates token by token. Decoder starts with just <pad> (the start token):
- Step 1: cross-attends to encoder output → strong attention on "Central banks", "raised", "rates" → generates "Central"
- Step 2: cross-attends again (still same encoder output, unchanged) → generates "banks"
- Step 3: attention shifts toward "2022", "inflation" → generates "raised"
- ... continues until
<EOS>is sampled
Step 3 — Output: "Central banks raised interest rates seven times in 2022 to combat inflation."
Key insight: the encoder output is computed once and never changes during decoding. Every decoder step reads the same , but uses different decoder-side queries (because the partially-decoded output grows). This is why encoder-decoder decoding is more expensive than decoder-only: the encoder runs once, then each decode step requires cross-attention over the full encoder output.
Training vs. inference — teacher forcing. During training, the decoder does not consume its own predicted tokens as inputs. Instead it receives the true previous token at every step: step is conditioned on (the ground-truth token), not (the model's prediction). This technique is called teacher forcing.
Teacher forcing is necessary to make training tractable. Without it, an error at decoding step 3 feeds into step 4, compounds into step 5, and within a few steps the decoder is in a state that the network has never seen during training — gradients become meaningless. Teacher forcing keeps every step conditioned on the true previous token, ensuring clean gradient estimates throughout the sequence. The downside — exposure bias, where the model never sees its own errors during training — is a real problem at inference time, addressed by techniques like scheduled sampling.
Analysis & Evaluation
Where Your Intuition Breaks
The encoder compresses the input into a representation that captures everything important. The original 2014 seq2seq architecture used a single fixed-size bottleneck vector — and it worked surprisingly well on short sequences. For long inputs, it doesn't: a single vector cannot hold the information from a 500-word document without significant loss. This is precisely why Bahdanau attention (1 year later) replaced the bottleneck with attention over all encoder states. The bottleneck assumption is wrong; the fix is now universal.
Enc-Decoder vs. Decoder-Only for Generation Tasks
The enc-dec architecture was dominant until ~2022. Since then, decoder-only models have largely taken over, even for generation tasks. Why?
| Property | Encoder-Decoder (T5, BART) | Decoder-Only (GPT, Llama, Claude) |
|---|---|---|
| Architecture | Encoder + cross-attention + decoder | Decoder only |
| Input representation | Rich bidirectional (encoder) | Causal (left-to-right only) |
| Pre-training | Span corruption / denoising | Next-token prediction |
| Scaling behavior | Good at moderate size | Scales more efficiently to very large |
| Few-shot ability | Weak without fine-tuning | Strong (in-context learning) |
| Seq2seq tasks | Strong (native) | Competitive at scale |
| Parameters per FLOPs | Efficient (encoder shared) | Less efficient at small scale |
Why enc-dec still wins in specific cases:
- Strict input-output boundary matters (you never want input tokens in the output): translation, structured extraction
- Context much longer than output: summarization of very long documents
- Latency budget for encoder is OK: you can run encoder once and cache for many decoding calls
Why decoder-only has taken over:
- In-context learning (few-shot prompting) works without fine-tuning
- Simpler KV cache infrastructure — only one model to optimize
- Better emergent reasoning at scale (>13B params)
T5 Variants
| Model | Key change | When to use |
|---|---|---|
| T5-base / T5-large | Original 2020 model | Fine-tuning baselines |
| T5-v1.1 | No dropout on pre-training; better init | Preferred over original T5 |
| mT5 | Multilingual (101 languages) | Cross-lingual tasks |
| FLAN-T5 | Instruction-tuned on 1,836 tasks | Zero/few-shot via prompting, best default choice |
| FLAN-UL2 | Mixture-of-Denoisers pre-training | Long-context tasks |
For most new projects requiring seq2seq: start with FLAN-T5-large (780M params). It handles prompting without fine-tuning, and fine-tunes well when you have labeled data.
When to reach for T5 / enc-dec:
- Summarization, translation, data-to-text, structured extraction — tasks where input and output are clearly separated
- You have labeled (input, output) pairs and budget to fine-tune — enc-dec fine-tuning is highly sample-efficient
- Latency matters and inputs are long: encode once, decode fast
When to skip it:
- You need strong zero-shot generalization without fine-tuning — prefer a decoder-only model
- Tasks require reasoning over very long context (32K+) — enc-dec context windows are smaller
- You're building a chatbot or instruction-following assistant — decoder-only is the natural fit
Practical tips:
- Task prefix format matters:
"summarize: "and"tldr: "produce different outputs from the same model - Max decoder length (
max_new_tokens) must be set explicitly — T5 won't stop unless<EOS>is generated or the limit is hit - Use
num_beams=4for quality-sensitive tasks (translation, summarization); greedy for speed-sensitive tasks - FLAN-T5 checkpoint sizes: small (80M), base (250M), large (780M), xl (3B), xxl (11B)
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.