Neural-Path/Notes
25 min

Seq2Seq & T5

Not every NLP task is about labeling or embedding a fixed input. Translation, summarization, and question answering all require generating a variable-length output from a variable-length input. This is the sequence-to-sequence (seq2seq) problem. T5 (Text-to-Text Transfer Transformer) unified this family of tasks under a single architecture and training objective, and its design choices — cross-attention, span corruption, task prefixes — remain the blueprint for modern encoder-decoder models.

Theory

encoder (source)click decoder token to trace cross-attentiondecoder (target)
Thecat0.70satonthemat.K, V (encoder output — fixed)Q (decoder state — changes each step)Lechatétaitassissurletapis.
"chat" attends most to "cat" (70%) — Q from decoder, K/V from encoder

weights are illustrative · "The" → "Le" and "the" → "le" both attend to articles

A seq2seq model has two parts: an encoder that reads the full input and compresses it into a set of representations, and a decoder that generates output one token at a time. Cross-attention — shown in the diagram — is how the decoder stays grounded in the source: at every decoding step, it looks back at all encoder positions and decides which parts of the input are most relevant right now. Without cross-attention, the decoder would have to carry the entire source meaning in a single fixed-size vector.

The Encoder-Decoder Architecture

An encoder-decoder Transformer has two sub-networks:

Encoder: processes the full input sequence with bidirectional self-attention, producing a set of contextualized representations H=(h1,h2,,hn)Rn×d\mathbf{H} = (h_1, h_2, \ldots, h_n) \in \mathbb{R}^{n \times d}.

Decoder: generates the output sequence one token at a time. At step tt, the decoder has access to:

  1. Previously generated tokens y1,,yt1y_1, \ldots, y_{t-1} (via causal self-attention)
  2. The full encoder output H\mathbf{H} (via cross-attention)

Cross-Attention

Cross-attention is the mechanism that lets the decoder "read" the encoder output. At each decoder layer, queries come from the decoder, but keys and values come from the encoder:

CrossAttn(Qdec,Kenc,Venc)=softmax ⁣(QdecKencdk)Venc\text{CrossAttn}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) = \text{softmax}\!\left(\frac{Q_{\text{dec}} K_{\text{enc}}^\top}{\sqrt{d_k}}\right) V_{\text{enc}}

where Qdec=HdecWQQ_{\text{dec}} = H_{\text{dec}} W^Q and Kenc=HWKK_{\text{enc}} = \mathbf{H} W^K, Venc=HWVV_{\text{enc}} = \mathbf{H} W^V.

This asymmetry is what makes enc-dec different from a pure decoder: instead of computing Q, K, V all from the same sequence, the Q originates in the decoder's current state while K and V are fixed projections of the encoder output.

💡Intuition

In translation, the decoder is asking: "given what I've generated so far, which source words are most relevant to generate the next target word?" The cross-attention scores answer that question. For example, when generating "chat" in French, the decoder should attend strongly to "cat" in the English source.

Full Forward Pass (Decoder at Step tt)

The decoder stack applies three sublayers at each layer:

  1. Causal self-attention over (y1,,yt1)(y_1, \ldots, y_{t-1}) — decoder tokens can only attend to earlier decoder tokens
  2. Cross-attention with QQ from step 1 output and K,VK, V from encoder output H\mathbf{H}
  3. Feed-forward network on the result

The final decoder hidden state at step tt is projected to vocabulary logits, then sampled or greedily decoded.

T5: Text-to-Text Transfer Transformer

T5 (Raffel et al., 2020) made a key unifying bet: every NLP task can be framed as text-in, text-out. Rather than designing task-specific heads, T5 prepends a natural language task prefix to the input:

TaskInput to T5Expected output
Translationtranslate English to French: The cat sat.Le chat était assis.
Summarizationsummarize: [long article...][short summary]
Sentimentsentiment: I loved this film.positive
QAquestion: Who wrote Hamlet? context: Shakespeare...Shakespeare

This framing means T5 is fine-tuned for new tasks simply by changing the prefix format — no architecture modification needed.

Span Corruption (T5's Pre-Training Objective)

BERT masks individual tokens (MLM). T5 masks contiguous spans of tokens and replaces each span with a single sentinel token. The model must predict the original spans:

Input: The cat ⟨X⟩ on ⟨Y⟩ mat. Target: ⟨X⟩ sat ⟨Y⟩ the ⟨EOS⟩

Formally, spans are sampled with mean length μ=3\mu = 3 tokens, covering 15% of input tokens. The target is the concatenation of all original spans (with their sentinels as delimiters).

Advantages over MLM:

  • The decoder learns to produce contiguous text, not just single tokens — better pretraining for generation tasks
  • Shorter targets reduce compute: predicting 3-token spans instead of masked positions
  • Multiple spans per example create richer training signal
python
from transformers import T5ForConditionalGeneration, T5Tokenizer
 
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base")
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
 
# T5 text-to-text: task prefix + input
input_text = "summarize: The Eiffel Tower was built between 1887 and 1889. " \
             "It stands 330 meters tall and is the most-visited monument in the world."
 
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_new_tokens=60)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
# → "The Eiffel Tower was built between 1887 and 1889 and stands 330 meters tall."

Walkthrough

Tracing a Summarization Request Through T5

Input (after tokenization):

summarize: Central banks raised interest rates sharply in 2022
to combat inflation. The Federal Reserve increased rates seven
times, bringing the federal funds rate from near zero to over 4%.

Step 1 — Encoder pass. All tokens processed in parallel with full bidirectional self-attention. Output: one 768-dim vector per input token, each informed by the entire input context.

Step 2 — Decoder generates token by token. Decoder starts with just <pad> (the start token):

  • Step 1: cross-attends to encoder output → strong attention on "Central banks", "raised", "rates" → generates "Central"
  • Step 2: cross-attends again (still same encoder output, unchanged) → generates "banks"
  • Step 3: attention shifts toward "2022", "inflation" → generates "raised"
  • ... continues until <EOS> is sampled

Step 3 — Output: "Central banks raised interest rates seven times in 2022 to combat inflation."

Key insight: the encoder output is computed once and never changes during decoding. Every decoder step reads the same H\mathbf{H}, but uses different decoder-side queries (because the partially-decoded output grows). This is why encoder-decoder decoding is more expensive than decoder-only: the encoder runs once, then each decode step requires cross-attention over the full encoder output.

Training vs. inference — teacher forcing. During training, the decoder does not consume its own predicted tokens as inputs. Instead it receives the true previous token at every step: step tt is conditioned on yt1y_{t-1}^* (the ground-truth token), not y^t1\hat{y}_{t-1} (the model's prediction). This technique is called teacher forcing.

Teacher forcing is necessary to make training tractable. Without it, an error at decoding step 3 feeds into step 4, compounds into step 5, and within a few steps the decoder is in a state that the network has never seen during training — gradients become meaningless. Teacher forcing keeps every step conditioned on the true previous token, ensuring clean gradient estimates throughout the sequence. The downside — exposure bias, where the model never sees its own errors during training — is a real problem at inference time, addressed by techniques like scheduled sampling.

Analysis & Evaluation

Where Your Intuition Breaks

The encoder compresses the input into a representation that captures everything important. The original 2014 seq2seq architecture used a single fixed-size bottleneck vector — and it worked surprisingly well on short sequences. For long inputs, it doesn't: a single vector cannot hold the information from a 500-word document without significant loss. This is precisely why Bahdanau attention (1 year later) replaced the bottleneck with attention over all encoder states. The bottleneck assumption is wrong; the fix is now universal.

Enc-Decoder vs. Decoder-Only for Generation Tasks

The enc-dec architecture was dominant until ~2022. Since then, decoder-only models have largely taken over, even for generation tasks. Why?

PropertyEncoder-Decoder (T5, BART)Decoder-Only (GPT, Llama, Claude)
ArchitectureEncoder + cross-attention + decoderDecoder only
Input representationRich bidirectional (encoder)Causal (left-to-right only)
Pre-trainingSpan corruption / denoisingNext-token prediction
Scaling behaviorGood at moderate sizeScales more efficiently to very large
Few-shot abilityWeak without fine-tuningStrong (in-context learning)
Seq2seq tasksStrong (native)Competitive at scale
Parameters per FLOPsEfficient (encoder shared)Less efficient at small scale

Why enc-dec still wins in specific cases:

  • Strict input-output boundary matters (you never want input tokens in the output): translation, structured extraction
  • Context much longer than output: summarization of very long documents
  • Latency budget for encoder is OK: you can run encoder once and cache for many decoding calls

Why decoder-only has taken over:

  • In-context learning (few-shot prompting) works without fine-tuning
  • Simpler KV cache infrastructure — only one model to optimize
  • Better emergent reasoning at scale (>13B params)

T5 Variants

ModelKey changeWhen to use
T5-base / T5-largeOriginal 2020 modelFine-tuning baselines
T5-v1.1No dropout on pre-training; better initPreferred over original T5
mT5Multilingual (101 languages)Cross-lingual tasks
FLAN-T5Instruction-tuned on 1,836 tasksZero/few-shot via prompting, best default choice
FLAN-UL2Mixture-of-Denoisers pre-trainingLong-context tasks

For most new projects requiring seq2seq: start with FLAN-T5-large (780M params). It handles prompting without fine-tuning, and fine-tunes well when you have labeled data.

🚀Production

When to reach for T5 / enc-dec:

  • Summarization, translation, data-to-text, structured extraction — tasks where input and output are clearly separated
  • You have labeled (input, output) pairs and budget to fine-tune — enc-dec fine-tuning is highly sample-efficient
  • Latency matters and inputs are long: encode once, decode fast

When to skip it:

  • You need strong zero-shot generalization without fine-tuning — prefer a decoder-only model
  • Tasks require reasoning over very long context (32K+) — enc-dec context windows are smaller
  • You're building a chatbot or instruction-following assistant — decoder-only is the natural fit

Practical tips:

  • Task prefix format matters: "summarize: " and "tldr: " produce different outputs from the same model
  • Max decoder length (max_new_tokens) must be set explicitly — T5 won't stop unless <EOS> is generated or the limit is hit
  • Use num_beams=4 for quality-sensitive tasks (translation, summarization); greedy for speed-sensitive tasks
  • FLAN-T5 checkpoint sizes: small (80M), base (250M), large (780M), xl (3B), xxl (11B)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.