Iterate with evals
In one line: Look at actual outputs, not at the prompt. The prompt is what you change; the outputs are what you measure.
The iterate phase is where most of the actual quality work happens. You read failing outputs, form a hypothesis about why they failed, change one thing, re-run the eval suite, and check whether the score went up without anything else going down. The trap everyone falls into is changing five things at once, declaring victory on the aggregate score, and silently regressing on a category nobody noticed. Disciplined iteration is what separates teams that go from 0.6 → 0.9 in a quarter from teams that plateau at 0.65.
The loop
- Inspect failures. Sort eval cases by score; read the bottom 10. Don't aggregate — read individual outputs. You're looking for patterns: "the model keeps inventing doc IDs," "the model hedges when the user is angry," "retrieval brings back irrelevant chunks for short queries."
- Hypothesize. "The model gets X wrong because it interprets Y as Z."
- Change one thing. A prompt instruction, a retrieval setting, the model, a few-shot example, the schema.
- Re-run the full eval suite. Not just the failing cases — make sure you didn't regress others.
- Commit and ship if scores improved on aggregate and didn't drop > 5% on any slice. Otherwise revert.
What to vary, in order of cheapness
prompt wording ← try first (free, instant)
↓
few-shot examples ← cheap, often dramatic
↓
system prompt structure ← free
↓
output schema ← cheap, often unlocks quality
↓
retrieval settings ← K, chunk size, reranker, hybrid
↓
model ← cost/latency trade-off
↓
decomposition ← split one call into two specialized calls
↓
fine-tuning ← expensive, rarely the right next step
The discipline is don't skip ahead. Teams reach for fine-tuning when a single sentence added to the system prompt would have fixed it.
What "change one thing" actually means
It's tempting to bundle "improved the retrieval, also tightened the prompt, also moved to a bigger model" into one PR. Don't.
# Bad: three changes, one commit
- top_k = 5
+ top_k = 10
- model = "claude-sonnet-4-5"
+ model = "claude-sonnet-4-6"
- prompt = OLD_PROMPT
+ prompt = NEW_PROMPT_WITH_FEW_SHOT
If the score goes up, you don't know which change helped. If it goes down, you don't know which change hurt. Three separate eval runs cost ~$6 and take 15 minutes. Just do it.
Reading failures well
Sit with the lowest-scoring 10 outputs. For each, answer:
- What did the model output?
- What was expected?
- Why did the model do what it did? (Imagine the model's POV: what was in the context, what was the instruction, what was ambiguous?)
- Is this a prompt problem, a retrieval problem, a model-capability problem, or a case-was-wrong problem?
The last category matters: ~10-20% of eval failures are actually bad eval cases (the expected output was wrong, or under-specified). Fix the case; don't blame the model.
Patterns of common fixes
| Failure pattern | Likely fix |
|---|---|
| Model hallucinates source IDs | Constrain output to only IDs from the retrieved set; validate before returning |
| Model hedges too much | Add a few-shot example showing confident answers |
| Model refuses safe requests | Soften system-prompt safety language; add positive examples |
| Retrieved docs are off-topic | Increase K, add a reranker, try hybrid (BM25 + dense), check chunking |
| Model loses earlier context in long convos | Summarize earlier turns; cap conversation length |
| Output schema fails to parse | Use the SDK's structured output instead of free-form JSON |
| Slow on long inputs | Truncate retrieved docs; cache stable prefixes; downgrade model |
| Inconsistent tone | Few-shot examples with target tone; add a tone field in output schema |
Tracking iteration
Keep a markdown log (per project, in the repo) that looks like:
## Iteration log
### 2026-05-12 — added "cite by doc_id, not by content" instruction
- Hypothesis: model was paraphrasing instead of citing
- Score: 0.61 → 0.68 (+0.07)
- Slice movement: billing +0.10, integrations +0.05, account +0.04
- Cost change: none
- Verdict: shipped
### 2026-05-13 — bumped K from 5 to 10
- Hypothesis: integrations failures looked like retrieval misses
- Score: 0.68 → 0.66 (-0.02)
- Slice movement: integrations -0.04 (more noise), billing flat
- Cost change: +18% per call
- Verdict: reverted
### 2026-05-14 — added reranker (Cohere rerank-3.5) on top-20 → top-5
- Score: 0.68 → 0.74 (+0.06)
- Slice movement: integrations +0.12, others flat
- Cost change: +$0.001 per call
- Latency change: +180ms
- Verdict: shipped
Two effects: future you remembers what you tried, and the team sees the score trajectory.
What to avoid
- Changing more than one thing per cycle. You won't know what helped.
- Aggregate-only inspection. "Score went up by 3%" without looking at examples hides regressions on important slices.
- Optimizing for evals at the expense of real users. Periodically sample real prod data; add failures back to the eval set.
- Tuning until the eval set goes to 1.0. Eval scores asymptote. The last 10% usually means the eval set is too easy.
- Iterating without a CI eval run. Every prompt change should trigger a fresh score.
Real numbers
| Item | Typical |
|---|---|
| Eval score gain in first month of iteration | 0.6 → 0.8 is common |
| Eval score gain in second month | 0.8 → 0.85 is common (diminishing returns) |
| Time per iteration cycle | 30-60 minutes including eval re-run |
| Iterations to plateau | 15-40 |
| Iteration cost (200-case eval × 30 iterations) | ~$60 |
At Acme, iteration cycle 1 (added a cite by doc_id instruction) moved the score from 0.61 to 0.68 in 20 minutes of work. Cycle 4 (added Cohere reranker) moved it from 0.68 to 0.74 and cost an extra $0.001/call + 180ms. By cycle 25 they hit 0.86, the team's pre-agreed "ship to internal cohort" bar. Total iteration time across two weeks: ~22 engineer-hours. Total iteration spend: ~$45 in eval runs.
The Acme engineer keeps docs/iteration-log.md in the repo and updates it after every eval run. Three months in, the log has 47 entries. About 60% of them were reverts. The team's quote-of-the-quarter: "We learn as much from the things we tried that didn't help as from the things that did."
The single biggest jump came from a non-obvious change: replacing the prompt's "Be helpful" instruction with three concrete examples of the desired tone. That moved tone-related grader scores from 0.51 to 0.79 in one commit.
Common anti-patterns
- Vibe iteration. "I added a sentence and it felt better." Without a re-run, you don't know.
- Refusing to revert. When the score drops, revert. Sunk cost is real.
- Slice-blindness. Aggregate up, but the user-facing slice (free-tier users, say) dropped. Ship and regret.
- Eval-set overfit. Tuning to the test, not to the world. Mitigate by adding fresh prod cases continuously.
- Letting the model choose. "Let's try a bigger model." Costs go up, score doesn't. Try the cheap fixes first.
- Iterating on the wrong layer. Spending three days on prompts when the retriever was the bottleneck.
- No iteration log. Every change should leave a record of what was tried and what happened.
- Reading aggregate scores in Slack updates. Always link to the per-slice breakdown.
- Changing prompts in production directly. Prompts are code. They go through a PR, with an eval diff attached.
- "This is good enough — let's stop iterating." Define "good enough" as a numeric eval threshold before you start. Otherwise the bar moves with mood.
- Iterating on the cold eval set forever without sampling prod. Eval cases age out of relevance. Weekly prod-sampling keeps them current.
- Treating iteration as a one-time pre-launch sprint. Iteration is the job, for the life of the product. Budget for it forever.
Checklist before moving on
- Eval score has moved meaningfully from baseline (typical: > +0.15).
- Per-slice scores are within ~5% of each other; no slice is silently regressing.
- An iteration log exists with at least 10 entries.
- Every prompt/retriever change goes through a PR with an eval diff.
- You've sampled at least one batch of real prod outputs and added failures to the eval set.
- You can articulate the next hypothesis you'd test if you had another week.
→ Next: Pre-production hardening