Efficient models & test-time compute

In one line: Not every call needs the frontier model — routing, hybrid architectures, and bounded test-time compute (reasoning tokens) are how teams afford agents at scale.

In plain English

Test-time compute means spending extra work at answer time — more thinking tokens, more search steps, a second model pass — to get a harder question right. The engineering skill is budgeting that spend: cheap models for easy turns, reasoning models for gnarly ones, and hard caps so one user request cannot drain your margin.

The three-lever mental model

Lever	What you control	Typical use
Model tier	Haiku vs. Sonnet vs. Opus / nano vs. mini vs. flagship	Route by task difficulty
Test-time compute	Reasoning depth, self-consistency samples, verifier pass	Hard math, code, multi-step planning
Architecture class	Pure transformer vs. hybrid SSM+attention	Long context, throughput-sensitive serving

Current names and prices live on the model snapshot. The ratios (frontier ≈ 4–10× workhorse) survive every price cut.

Model routing in production

The modern default is a router (rules, classifier, or small model) that picks a tier per request:

Patterns from production patterns and decision frameworks:

Cascade — try cheap first; escalate to expensive only if confidence is low or eval fails on a sample.
Parallel verify — cheap draft + small verifier model (cheaper than one frontier call for some tasks).
Per-step routing in agents — grep might be a tiny model; architecture design might be reasoning tier.

Routing without evals is guesswork — measure win rate and cost per tier on your dataset.

Test-time compute (reasoning budgets)

Reasoning models spend extra tokens thinking before answering. That is test-time compute: more inference work, better results on hard problems, higher bill and latency.

Harness responsibilities:

Cap thinking tokens per request and per agent loop iteration
Expose a user-visible tradeoff — fast vs. deep mode
Fall back when budget exhausted — partial answer or ask to narrow the question

June 2026 snapshot

Thinking tokens often add 1K–30K tokens and 5–60s latency per answer on frontier reasoning SKUs — see model snapshot. Treat numbers as volatile; treat budgeting discipline as durable.

Hybrid and efficient architectures (concept level)

Research and open-weight stacks increasingly mix State Space Models (SSM) — e.g. Mamba-class layers — with transformer attention for long sequences at lower memory. You rarely pick this as an app engineer today; you might choose an inference provider or open model advertising better long-context $/token or tokens/sec.

What to carry away:

Attention scales poorly with very long contexts; hybrids target cheaper long-range processing.
Serving economics (throughput on inference servers) matter as much as benchmark scores for high-volume products.
Diffusion language models and other non-autoregressive generators are an active research thread — interesting for latency-shaped workloads, not yet the default app stack.

Do not rewrite your product around a paper; evaluate on your eval set when a new architecture ships as a hosted model.

Agents multiply cost

One user message can become ten model calls plus retrieval. Efficient inference is mandatory for agent products:

Cache stable prefixes (prompt caching)
Batch offline jobs (batch inference)
Route tool-planning to workhorse, final polish to frontier only when needed
Enforce trajectory efficiency metrics

Duolingo Max is a case study in per-turn cost control — persona and quality without frontier pricing on every exchange.

→ Next: Research radar (June 2026)

🤔 Quick checkQuick check

The three-lever mental model​

Model routing in production​

Test-time compute (reasoning budgets)​

Hybrid and efficient architectures (concept level)​

Agents multiply cost​

The three-lever mental model

Model routing in production

Test-time compute (reasoning budgets)

Hybrid and efficient architectures (concept level)

Agents multiply cost