Part 7: Fine-tuning & Customization

The point where you stop steering a model with words and start changing what it has learned.

In one line: Fine-tuning takes an existing model and keeps training it on your examples so the new behaviour is baked into the weights — and most of the skill is in the dataset, the decision to do it at all, and proving it actually worked.

In plain English

Prompting is giving a smart new hire a sticky note of instructions for each task. RAG is handing them a binder to look things up in. Fine-tuning is sending them on a training course so the skill becomes second nature — they no longer need the sticky note. That course is expensive and slow compared to a sticky note, so you only send people on it when the note keeps getting ignored, is too long, or the same lesson is needed thousands of times a day. This chapter teaches you when to send the model to training, how to write its coursework (the dataset), how the training actually works under the hood, and how to check it learned the right thing without forgetting everything else.

note

New to the math? Skim the Math primer first — fine-tuning leans on a little intuition for loss, gradients, and learning rate, and one page there makes the rest of this chapter easier.

What this chapter covers

This is the self-contained, first-principles treatment of fine-tuning. By the end you can decide whether to fine-tune, build a clean dataset, run supervised fine-tuning with LoRA/QLoRA on a budget, understand preference tuning and distillation, evaluate the result against the base model, and serve it in production.

When to fine-tune (and when not to) — the honest decision tree: why prompting and RAG usually win first, the three things fine-tuning is genuinely good at, and the cost/maintenance reality nobody warns you about.
Data preparation: the dataset IS the product — chat/JSONL formats, why quality beats quantity, how many examples you actually need, sourcing, cleaning, the train/validation split, and synthetic data.
Supervised fine-tuning (SFT) from scratch — loss, epochs, learning rate, full fine-tuning vs parameter-efficient, and how to spot overfitting before it ships.
LoRA & QLoRA — why tiny low-rank adapters work, rank and alpha, 4-bit quantization, the memory math that lets you train a big model on one GPU, and when to reach for each.
Preference tuning: RLHF & DPO — aligning a model to preferred behaviour: the reward-model-plus-PPO pipeline, the simpler DPO family that mostly replaced it, and when you'd actually need either.
Distillation — using a big frontier "teacher" to generate training data for a small cheap "student," so a model you can afford to run punches far above its size.
Evaluating fine-tunes — held-out evals, regression against the base model, catastrophic forgetting, and A/B testing in production. The "did it actually work?" page.
Serving fine-tuned models — hosted FT endpoints vs self-hosting, multi-adapter LoRA hot-swapping, versioning, and rollback.

How to read this chapter

Read it in order the first time — each page builds on the last. Page 2 is the most important page in the chapter: most people who "need fine-tuning" actually need a better prompt or RAG, and you should be able to tell the difference. Pages 3–5 are the hands-on core (data, then the training itself, then the cheap way to do it). Pages 6–7 are powerful but more specialized — skim them now, return when you need them. Pages 8–9 are non-negotiable for anything you ship: an unevaluated, un-versioned fine-tune is a liability.

This chapter deepens decisions you may have met earlier — the prompt-vs-RAG-vs-fine-tune decision, the fine-tuning walkthrough, and the fine-tuning platforms page. We re-teach every concept from scratch here, but those are good companions. For proving a fine-tune worked, lean on the whole Evaluation chapter.

→ Start with When to fine-tune (and when not to).

What this chapter covers​

How to read this chapter​

What this chapter covers

How to read this chapter