Agent harness engineering

In one line: An agent is a model plus a harness — the code that decides what it sees, what tools it gets, when to stop, and what to remember — and in production the harness often matters more than swapping the base model.

In plain English

Think of the model as the engine and the harness as everything else in the car: steering, brakes, fuel gauge, GPS. Two teams can use the same frontier model; the one with a better harness ships a reliable product while the other chases flaky demos. This page names what belongs in that layer and links back to foundations you already have.

Model vs. harness

Layer	Owns	Changes when
Model	Language, reasoning, tool-call formatting	Provider releases a new version
Harness	Loop control, context assembly, tool allowlists, memory, retries, observability	You ship features and fix failures

The agent loop lives in the harness. So do MCP connections, planning and reflection prompts, and the caps that stop runaway loops.

What a production harness must decide

1. Context assembly — What goes in the window this turn? System instructions, retrieved docs, tool results, compressed history. See context windows and memory. The harness is where context engineering happens: not bigger windows alone, but curating what fills them.

2. Tool routing — Which tools exist, in what order, with what permissions? A coding agent might expose read_file, grep, and run_tests — but not rm -rf. Function calling defines the interface; the harness defines the policy.

3. Memory tiers — Short-term (this session's messages), working (summaries and scratchpads), long-term (user prefs, past tasks). Not everything belongs in the prompt every turn. The harness writes and reads memory; the model only sees what the harness injects.

4. Budgets and stop rules — Max iterations, max tool calls, max dollars, max wall-clock time. When the budget hits zero, the harness must degrade gracefully — partial answer, ask the user, or hand off to a human — not spin forever.

5. Observability — Every tool call and model turn is a span in a trace. Without this you cannot debug agent failures or build trajectory evals.

Why harness work compounds

Upgrading from Sonnet to Opus might lift success rate a few points on hard tasks. Fixing the harness — better retrieval injection, tighter tool schemas, a retry when JSON is malformed — often moves reliability more than raw IQ. Case studies like Claude Code and Cursor differ less in which frontier model they call than in how context and tools are orchestrated.

Link to the core guide

If agents are new to you, read What is an agent loop? and Agent frameworks first. This page assumes that baseline and focuses on what teams optimize in 2026 production harnesses.

Common harness mistakes

Throwing every tool at the model — tool choice error rate climbs with catalog size; start minimal, add tools when evals prove need.
Unbounded history — stuffing full chat into context until quality collapses; summarize or retrieve instead.
No exit strategy — agents that loop until the user closes the tab; always define max steps and a fallback message.
Evaluating only the final message — the harness can fail on step 3 while step 10 looks fine; see trajectory evals.

Practical checklist

Before calling an agent production-ready, the harness should answer yes to:

Tool allowlist matches least privilege for the task
Context builder has a documented recipe (what gets injected when)
Memory writes are explicit (nothing implicit in model prose)
Budgets enforced in code, not only in the system prompt
Full traces exported to your observability stack

→ Next: Agentic RAG & memory

🤔 Quick checkQuick check

Model vs. harness​

What a production harness must decide​

Why harness work compounds​

Common harness mistakes​

Practical checklist​

Model vs. harness

What a production harness must decide

Why harness work compounds

Common harness mistakes

Practical checklist