Fine-Tuning in Practice
The Module 03 SFT lesson covers how fine-tuning works mechanically. This lesson covers when to use it and how to configure it as an engineering decision. Fine-tuning has real costs — data collection, training runs, evaluation, and ongoing maintenance — and it isn't always the right tool. Prompting and RAG solve many problems faster. Understanding where fine-tuning wins, how to select LoRA hyperparameters, and how to measure whether it worked separates practitioners who use fine-tuning effectively from those who waste weeks on it.
Theory
params = r × 2D × modules × layers · quality curve is illustrative · ■ params bar · ▏quality marker
Full fine-tuning changes every weight in the model — billions of parameters — to shift behavior by a small amount. LoRA asks: what if the update itself is low-rank? Instead of storing the full weight update matrix, store two thin matrices whose product approximates it. The diagram above shows how rank controls the expressivity of the update: low rank for format/style changes, higher rank for new reasoning patterns. The insight that makes LoRA practical is empirical: weight updates from fine-tuning are consistently low-rank in practice, so the approximation loses almost nothing.
Parameter Count: LoRA vs Full Fine-Tuning
For a weight matrix , LoRA adds trainable parameters vs for full fine-tuning. Across a transformer with layers and typical projection matrices ():
The low-rank factorization is justified by empirical evidence from Aghajanyan et al. (2020), who showed that fine-tuned weight updates occupy a surprisingly low-dimensional subspace — the "intrinsic dimensionality" of adaptation is far smaller than the full parameter count suggests. If adaptation were inherently high-dimensional, LoRA would lose significant quality at small . The practical finding — that captures most of the adaptation quality across diverse tasks — is what makes LoRA viable, not the factorization itself.
where is the number of targeted projection matrices (commonly 2 for q_proj + v_proj, up to 6 for all attention + MLP).
At , (7B-class model), 32 layers, 4 modules: trainable parameters vs ~7B total — roughly 0.1%.
The Alpha Scaling Convention
The LoRA update is scaled by , so the effective learning rate contribution of the adapter is:
Setting (a common default) gives , doubling the effective scale of the adapter relative to baseline. Intuition: a higher rank covers more of the weight space, so you scale down the per-rank contribution to keep the effective magnitude stable.
Practical consequence: when you increase , increase proportionally to keep the same effective scale. Doubling without doubling halves the per-parameter contribution.
Data Efficiency
The number of examples needed scales roughly with task complexity:
| Task type | Typical examples needed |
|---|---|
| Output format / style | 200–500 |
| Domain vocabulary + terminology | 500–2K |
| New reasoning patterns | 2K–10K |
| New factual knowledge | 10K+ (unreliable — use RAG instead) |
Fine-tuning teaches the model how to respond, not what facts to know. For knowledge-intensive tasks, RAG outperforms fine-tuning at any dataset size because retrieval is grounded at inference time rather than baked into weights at training time.
Walkthrough
Configuring LoRA for a Domain Classification Task
Task: classify customer support tickets into 20 categories (billing, cancellation, technical, etc.) with a structured JSON output.
Starting point: FLAN-T5-large (780M params) or Llama-3.2-3B.
Step 1 — Choose rank:
from peft import LoraConfig
# For format/style tasks: r=4 or r=8 is sufficient
# For this classification task, r=8 with all attention layers
lora_cfg = LoraConfig(
r=8,
lora_alpha=16, # alpha = 2r
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)Step 2 — Validate data format and quality:
# Check class balance — severe imbalance harms fine-tuning
from collections import Counter
label_counts = Counter(ex["label"] for ex in train_dataset)
# Aim for no class with less than ~50 examples
# Oversample rare classes or collect more data before training
# Verify your output format is exactly what you want
sample = train_dataset[0]
print(f"Input: {sample['prompt'][:200]}")
print(f"Output: {sample['completion']}")
# Should show: {"category": "billing", "confidence": "high"}Step 3 — Train with early stopping:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=val_dataset,
peft_config=lora_cfg,
args=SFTConfig(
num_train_epochs=3,
learning_rate=2e-4,
per_device_train_batch_size=8,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
),
)
trainer.train()Step 4 — Evaluate on held-out test set:
# Don't just check loss — measure task-level accuracy
predictions = [model.generate(ex["prompt"]) for ex in test_dataset]
accuracy = sum(p == t for p, t in zip(predictions, test_labels)) / len(test_labels)
# Also check: does the model produce valid JSON?
valid_json_rate = sum(is_valid_json(p) for p in predictions) / len(predictions)Analysis & Evaluation
Where Your Intuition Breaks
Higher rank means better fine-tuning results. Rank controls the expressivity budget of the adapter, not its quality. Increasing rank beyond the intrinsic dimensionality of the task adds trainable parameters that overfit to noise rather than capturing signal. Empirically, diminishing returns set in around for most tasks, and rarely outperforms while adding significantly to training time and memory. Higher rank is only justified when the task genuinely requires modeling complex, high-dimensional transformations — new reasoning patterns on large diverse datasets, not format adaptation.
Fine-Tuning vs Prompting vs RAG
| Dimension | Fine-Tuning | Prompting | RAG |
|---|---|---|---|
| Upfront cost | High (data + training) | None | Medium (pipeline + index) |
| Per-inference cost | Smaller model → cheaper | Full model required | Retrieval overhead |
| Latency | Lower (smaller model) | Higher | Higher (retrieval round-trip) |
| Output format | Very reliable | Fragile with long schemas | Varies |
| New factual knowledge | Unreliable (memorization) | Only in-context | Strong (grounded at query time) |
| Behavior change | Strong | Moderate | No behavior change |
| Best for | Consistent style/format, domain vocabulary | Prototyping, general reasoning | Knowledge-intensive Q&A, dynamic data |
Decision heuristic: if you're primarily fixing how the model responds (format, tone, domain-specific phrasing), fine-tune. If you're trying to give the model access to information it doesn't have, use RAG. If the task is new and you're not sure, start with prompting to validate the task works at all.
Common Fine-Tuning Failures
Training on format, not behavior: the model learns to mimic the training set format but doesn't generalize. Symptom: high accuracy on validation (which comes from same distribution as training) but poor performance on real queries. Fix: held-out evaluation on real user queries, not a random split of your training data.
Incorrect loss masking: loss computed on prompt tokens. Symptom: very low training loss (the model "cheats" by predicting the known prompt). Fix: verify labels tensor has -100 for all prompt tokens.
Learning rate too high: loss spikes early. Fix: reduce LR by 2–5×. With LoRA, 2e-4 is usually safe; 5e-4 often causes instability.
Before you fine-tune, ask these questions:
- Does the base model + good prompting already get 70%+ of the way there? If yes, prompting is probably sufficient.
- Do you have at least 500 high-quality examples in the exact format you want? If not, collect data first.
- Will you be able to maintain and re-train this model when the base model is updated? Fine-tuning creates a maintenance burden.
If the answer to all three is yes, fine-tune. Otherwise, exhaust prompting and RAG first.
LoRA defaults that work for 90% of cases: r=8, alpha=16, target_modules=["q_proj","v_proj"], lr=2e-4, epochs=2.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.