Fine-Tuning in Practice

The Module 03 SFT lesson covers how fine-tuning works mechanically. This lesson covers when to use it and how to configure it as an engineering decision. Fine-tuning has real costs — data collection, training runs, evaluation, and ongoing maintenance — and it isn't always the right tool. Prompting and RAG solve many problems faster. Understanding where fine-tuning wins, how to select LoRA hyperparameters, and how to measure whether it worked separates practitioners who use fine-tuning effectively from those who waste weeks on it.

Theory

LoRA rank — trainable parameters vs quality

rank: 8trainable: 4.2Mof base: 0.06%quality est: 81%

Default sweet spot — covers most instruction-following and domain tasks.

params = r × 2D × modules × layers · quality curve is illustrative · ■ params bar · ▏quality marker

Full fine-tuning changes every weight in the model — billions of parameters — to shift behavior by a small amount. LoRA asks: what if the update itself is low-rank? Instead of storing the full weight update matrix, store two thin matrices whose product approximates it. The diagram above shows how rank $r$ controls the expressivity of the update: low rank for format/style changes, higher rank for new reasoning patterns. The insight that makes LoRA practical is empirical: weight updates from fine-tuning are consistently low-rank in practice, so the approximation loses almost nothing.

Parameter Count: LoRA vs Full Fine-Tuning

For a weight matrix $W \in \mathbb{R}^{d \times k}$ , LoRA adds $r(d + k)$ trainable parameters vs $dk$ for full fine-tuning. Across a transformer with $L$ layers and typical projection matrices ( $d = k = D$ ):

$\text{LoRA params} = r \cdot 2D \cdot N_{\text{modules}} \cdot L$

The low-rank factorization is justified by empirical evidence from Aghajanyan et al. (2020), who showed that fine-tuned weight updates occupy a surprisingly low-dimensional subspace — the "intrinsic dimensionality" of adaptation is far smaller than the full parameter count suggests. If adaptation were inherently high-dimensional, LoRA would lose significant quality at small $r$ . The practical finding — that $r \leq 16$ captures most of the adaptation quality across diverse tasks — is what makes LoRA viable, not the factorization itself.

where $N_{\text{modules}}$ is the number of targeted projection matrices (commonly 2 for q_proj + v_proj, up to 6 for all attention + MLP).

At $r=8$ , $D=4096$ (7B-class model), 32 layers, 4 modules: $8 \times 2 \times 4096 \times 4 \times 32 \approx 8.4\text{M}$ trainable parameters vs ~7B total — roughly 0.1%.

The Alpha Scaling Convention

The LoRA update is scaled by $\alpha / r$ , so the effective learning rate contribution of the adapter is:

$\Delta W = \frac{\alpha}{r} A B^\top$

Setting $\alpha = 2r$ (a common default) gives $\alpha/r = 2$ , doubling the effective scale of the adapter relative to $r=1$ baseline. Intuition: a higher rank covers more of the weight space, so you scale down the per-rank contribution to keep the effective magnitude stable.

Practical consequence: when you increase $r$ , increase $\alpha$ proportionally to keep the same effective scale. Doubling $r$ without doubling $\alpha$ halves the per-parameter contribution.

Data Efficiency

The number of examples needed scales roughly with task complexity:

Task type	Typical examples needed
Output format / style	200–500
Domain vocabulary + terminology	500–2K
New reasoning patterns	2K–10K
New factual knowledge	10K+ (unreliable — use RAG instead)

Fine-tuning teaches the model how to respond, not what facts to know. For knowledge-intensive tasks, RAG outperforms fine-tuning at any dataset size because retrieval is grounded at inference time rather than baked into weights at training time.

Walkthrough

Configuring LoRA for a Domain Classification Task

Task: classify customer support tickets into 20 categories (billing, cancellation, technical, etc.) with a structured JSON output.

Starting point: FLAN-T5-large (780M params) or Llama-3.2-3B.

Step 1 — Choose rank:

python

from peft import LoraConfig
 
# For format/style tasks: r=4 or r=8 is sufficient
# For this classification task, r=8 with all attention layers
lora_cfg = LoraConfig(
    r=8,
    lora_alpha=16,          # alpha = 2r
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Step 2 — Validate data format and quality:

python

# Check class balance — severe imbalance harms fine-tuning
from collections import Counter
label_counts = Counter(ex["label"] for ex in train_dataset)
# Aim for no class with less than ~50 examples
# Oversample rare classes or collect more data before training
 
# Verify your output format is exactly what you want
sample = train_dataset[0]
print(f"Input: {sample['prompt'][:200]}")
print(f"Output: {sample['completion']}")
# Should show: {"category": "billing", "confidence": "high"}

Step 3 — Train with early stopping:

python

from trl import SFTTrainer, SFTConfig
 
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_cfg,
    args=SFTConfig(
        num_train_epochs=3,
        learning_rate=2e-4,
        per_device_train_batch_size=8,
        load_best_model_at_end=True,
        metric_for_best_model="eval_loss",
    ),
)
trainer.train()

Step 4 — Evaluate on held-out test set:

python

# Don't just check loss — measure task-level accuracy
predictions = [model.generate(ex["prompt"]) for ex in test_dataset]
accuracy = sum(p == t for p, t in zip(predictions, test_labels)) / len(test_labels)
 
# Also check: does the model produce valid JSON?
valid_json_rate = sum(is_valid_json(p) for p in predictions) / len(predictions)

Analysis & Evaluation

Where Your Intuition Breaks

Higher rank means better fine-tuning results. Rank controls the expressivity budget of the adapter, not its quality. Increasing rank beyond the intrinsic dimensionality of the task adds trainable parameters that overfit to noise rather than capturing signal. Empirically, diminishing returns set in around $r=16$ for most tasks, and $r > 64$ rarely outperforms $r=16$ while adding significantly to training time and memory. Higher rank is only justified when the task genuinely requires modeling complex, high-dimensional transformations — new reasoning patterns on large diverse datasets, not format adaptation.

Fine-Tuning vs Prompting vs RAG

Dimension	Fine-Tuning	Prompting	RAG
Upfront cost	High (data + training)	None	Medium (pipeline + index)
Per-inference cost	Smaller model → cheaper	Full model required	Retrieval overhead
Latency	Lower (smaller model)	Higher	Higher (retrieval round-trip)
Output format	Very reliable	Fragile with long schemas	Varies
New factual knowledge	Unreliable (memorization)	Only in-context	Strong (grounded at query time)
Behavior change	Strong	Moderate	No behavior change
Best for	Consistent style/format, domain vocabulary	Prototyping, general reasoning	Knowledge-intensive Q&A, dynamic data

Decision heuristic: if you're primarily fixing how the model responds (format, tone, domain-specific phrasing), fine-tune. If you're trying to give the model access to information it doesn't have, use RAG. If the task is new and you're not sure, start with prompting to validate the task works at all.

Common Fine-Tuning Failures

Training on format, not behavior: the model learns to mimic the training set format but doesn't generalize. Symptom: high accuracy on validation (which comes from same distribution as training) but poor performance on real queries. Fix: held-out evaluation on real user queries, not a random split of your training data.

Incorrect loss masking: loss computed on prompt tokens. Symptom: very low training loss (the model "cheats" by predicting the known prompt). Fix: verify labels tensor has -100 for all prompt tokens.

Learning rate too high: loss spikes early. Fix: reduce LR by 2–5×. With LoRA, 2e-4 is usually safe; 5e-4 often causes instability.

🚀Production

Before you fine-tune, ask these questions:

Does the base model + good prompting already get 70%+ of the way there? If yes, prompting is probably sufficient.
Do you have at least 500 high-quality examples in the exact format you want? If not, collect data first.
Will you be able to maintain and re-train this model when the base model is updated? Fine-tuning creates a maintenance burden.

If the answer to all three is yes, fine-tune. Otherwise, exhaust prompting and RAG first.

LoRA defaults that work for 90% of cases: r=8, alpha=16, target_modules=["q_proj","v_proj"], lr=2e-4, epochs=2.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Agents & Tool Use

Evals Framework