Evaluating fine-tunes
In one line: A fine-tune is a hypothesis ("this model is better at our task") until a held-out eval proves it — and you must check not just that it got better at the target task, but that it didn't quietly get worse at everything else.
Imagine sending that employee on a course and just assuming it helped because the course was expensive. Ridiculous — you'd test them afterward, and you'd also check they didn't forget how to do their other duties while specializing. Evaluating a fine-tune is exactly that: a graded exam on the new skill, plus a regression check that the old skills survived. Skipping it is how teams ship a model that's better at one thing and broken at three others — and never find out until users do.
This page is the fine-tuning-specific application of evaluation. The full discipline — building golden sets, metrics, LLM-as-judge, CI gating — lives in the Evaluation chapter, which you should treat as required reading alongside this page. Here we focus on the three questions unique to fine-tunes.
Question 1: Did it get better at the target task?
You answer this with a held-out eval set — examples the model never saw in training (not the train set, not even the validation set you watched during training). It must be a fresh, representative sample of the real task.
The non-negotiable move: always compare against the base model on the same eval set. "The fine-tune scores 82%" is meaningless. "The fine-tune scores 82% vs the base model's 61% on the same 200 held-out cases" is a result.
# Pseudocode for the core comparison — the shape of every fine-tune eval.
def score(model, eval_set, grade) -> float:
correct = 0
for case in eval_set:
out = model.generate(case["input"])
correct += grade(out, case) # exact-match, schema-valid, judge, etc.
return correct / len(eval_set)
base_score = score(base_model, held_out, grade)
ft_score = score(finetuned_model, held_out, grade)
print(f"base={base_score:.1%} fine-tuned={ft_score:.1%} "
f"delta={ft_score - base_score:+.1%}")
# Ship only if the delta is real (and ideally statistically meaningful, not n=10).
How you grade depends on the task — exact match for classification, schema validity for structured output, similarity or an LLM-as-judge for open-ended text. Pick the grader before you run, so you can't rationalize the result afterward.
Question 2: Did it get worse at anything? (Regression & catastrophic forgetting)
This is the check beginners forget, and it's the one that bites. Fine-tuning hard on a narrow task can degrade abilities outside that task — known as catastrophic forgetting. The model becomes a brilliant support agent that can no longer count, follow general instructions, or write in another language it used to handle.
Defend against it with a regression / capability suite run alongside your task eval:
- A general-capability mini-suite. A small fixed set of off-task prompts (basic reasoning, instruction following, formatting, multilingual if relevant). Score base vs fine-tuned; a big drop is a red flag.
- Safety/refusal checks. Fine-tuning can erode safety behaviour. Re-run your safety prompts on the fine-tune. (See safety patterns.)
- The "format obsession" check. A model fine-tuned to always emit JSON may now emit JSON even when you ask for prose. Test a few off-distribution prompts.
Base model Fine-tuned Verdict
Target task 61% 82% +21 ✓ the win
General reasoning 74% 73% -1 ✓ acceptable
Instruction-follow 88% 71% -17 ✗ catastrophic forgetting!
Safety refusals 99% 97% -2 ⚠ watch
If you see forgetting: lower the learning rate, train fewer epochs, prefer LoRA over full fine-tuning (frozen base forgets less), or mix a small fraction of general/replay data into your training set so the model keeps practicing old skills.
Question 3: Does it win with real users? (A/B in production)
Offline evals approximate reality; production tells the truth. Before a fine-tune fully replaces the incumbent, A/B test it:
- Route a small slice (start at 5–10%) of traffic to the fine-tune, keyed on a stable id (user/tenant) so a given user gets a consistent experience.
- Compare real outcome metrics, not vibes: resolution rate, thumbs-up rate, escalation rate, regenerations, latency, cost-per-request.
- Keep the rollback button warm. If the treatment underperforms, you can shift traffic back instantly — which is exactly what the serving page makes cheap with adapter hot-swapping and versioned endpoints.
Production also feeds your next fine-tune: the cases where the model failed become new training/eval data. That data flywheel — sample production, grade it, fold the worst into your sets — is covered in the Evaluation chapter, and it's how fine-tunes stay good as the task drifts.
A pragmatic eval workflow
1. Freeze a held-out eval set BEFORE training. Never train on it.
2. Define graders up front (exact-match / schema / judge).
3. After training: score BASE vs FINE-TUNED on the held-out set.
4. Run the regression suite (general + safety) — catch forgetting.
5. If both pass: A/B a small traffic slice in production.
6. Watch outcome metrics; ramp up or roll back.
7. Sample production failures -> next dataset. Repeat.
Common pitfalls
- No held-out set / evaluating on training data. Memorized examples score ~100% and mean nothing. Hold data out and never touch it until the verdict.
- No base-model baseline. A score with nothing to compare to is noise. Always run base vs fine-tuned on the same set.
- Forgetting the regression check. The fine-tune got better at the target and worse at three other things — and you shipped it because you only measured the target.
- Trusting training/validation loss as the verdict. Low loss ≠ good task performance. Run the actual task eval.
- Tiny eval sets. n=10 deltas are noise. Use enough cases (and ideally significance/confidence) to trust the difference.
- Skipping the production A/B. Offline wins don't always hold up live. Ramp behind an A/B with a rollback ready.
→ Next: Serving fine-tuned models