Trajectory & process evals

In one line: For agents, the path matters as much as the destination — trajectory evals score whether the right tools were called in the right order, not only whether the final string looks good.

In plain English

Imagine grading a math student. You can score only the final number (outcome eval) or also whether they showed valid steps (process eval). Agents are the same: a lucky final answer can hide a broken tool chain, wasted searches, or a security violation on step two. Trajectory evals catch what final-answer grading misses.

Outcome vs. process

Eval type	Question	Catches
Outcome	Is the final answer correct / helpful / faithful?	Wrong conclusions, hallucinations
Process (trajectory)	Were the steps reasonable, safe, and efficient?	Wrong tool choice, skipped retrieval, runaway loops, policy violations

You need both. Outcome-only evals let agents gamble — delete the wrong file, recover accidentally, and still pass. Process evals align with how LLM-as-judge and human eval already work — but the unit of grading becomes a trace, not one blob of text.

What to score on a trajectory

Tool correctness — For task X, did the agent call the expected tool family (search before answer, run_tests after edit)? Gold trajectories can be partially specified — required steps without micromanaging every argument.

Efficiency — Steps to success, tokens spent, wall-clock time. A correct answer in forty tool calls may fail a production SLO even if outcome eval passes.

Safety — No forbidden tools, no PII in logs, no exfiltration patterns. Tie to OWASP LLM Top 10.

Faithfulness per step — Did intermediate claims stay supported by retrieved sources before the model synthesized?

Recovery — After a tool error, did the agent retry sensibly or spiral?

How to implement without drowning in work

1. Start from traces you already log. If the harness exports spans, eval cases are (input, trace, expected_outcome, optional_step_constraints).

2. LLM judge on the trace. Feed the judge a condensed trace (tool name, args summary, result summary — not full payloads) plus a rubric:

TRAJECTORY_RUBRIC = """
Score 1-5 on process quality:
5 = Minimal necessary tools, safe, efficient, grounded at each step
3 = Reached goal but with redundant searches or one minor policy slip
1 = Unsafe tool use, runaway loop, or answer despite missing evidence
"""

Calibrate judges against humans the same way as LLM-as-judge — agreement on ~30–50 cases before gating CI.

3. Deterministic checks where possible. Required tool called? Max steps exceeded? Citation present when search ran? Cheap gates before expensive judges.

4. Pairwise on trajectories. When comparing harness v1 vs. v2, ask: which trace would you rather debug in production? — often clearer than absolute scores.

CI and regression

Eval-driven development applies: trajectory metrics belong in the same regression suite as outcome metrics. Typical gates:

Outcome win rate ≥ baseline (pairwise or absolute threshold)
Process score ≥ baseline
P95 steps and P95 cost ≤ baseline + slack

A release that improves final answers but doubles tool spam should fail the process gate even if users occasionally notice.

Relation to online eval

Production traces feed the eval dataset — missteps users report become trajectory cases with full spans. This closes the loop described in continuous learning: the frontier skill is curating step-level failures, not only thumbs-down on the last message.

→ Next: Efficient models & test-time compute

🤔 Quick checkQuick check

Outcome vs. process​

What to score on a trajectory​

How to implement without drowning in work​

CI and regression​

Relation to online eval​

Outcome vs. process

What to score on a trajectory

How to implement without drowning in work

CI and regression

Relation to online eval