Efficient models & test-time compute
In one line: Not every call needs the frontier model — routing, hybrid architectures, and bounded test-time compute (reasoning tokens) are how teams afford agents at scale.
Test-time compute means spending extra work at answer time — more thinking tokens, more search steps, a second model pass — to get a harder question right. The engineering skill is budgeting that spend: cheap models for easy turns, reasoning models for gnarly ones, and hard caps so one user request cannot drain your margin.
The three-lever mental model
| Lever | What you control | Typical use |
|---|---|---|
| Model tier | Haiku vs. Sonnet vs. Opus / nano vs. mini vs. flagship | Route by task difficulty |
| Test-time compute | Reasoning depth, self-consistency samples, verifier pass | Hard math, code, multi-step planning |
| Architecture class | Pure transformer vs. hybrid SSM+attention | Long context, throughput-sensitive serving |
Current names and prices live on the model snapshot. The ratios (frontier ≈ 4–10× workhorse) survive every price cut.
Model routing in production
The modern default is a router (rules, classifier, or small model) that picks a tier per request:
Patterns from production patterns and decision frameworks:
- Cascade — try cheap first; escalate to expensive only if confidence is low or eval fails on a sample.
- Parallel verify — cheap draft + small verifier model (cheaper than one frontier call for some tasks).
- Per-step routing in agents — grep might be a tiny model; architecture design might be reasoning tier.
Routing without evals is guesswork — measure win rate and cost per tier on your dataset.
Test-time compute (reasoning budgets)
Reasoning models spend extra tokens thinking before answering. That is test-time compute: more inference work, better results on hard problems, higher bill and latency.
Harness responsibilities:
- Cap thinking tokens per request and per agent loop iteration
- Expose a user-visible tradeoff — fast vs. deep mode
- Fall back when budget exhausted — partial answer or ask to narrow the question
Thinking tokens often add 1K–30K tokens and 5–60s latency per answer on frontier reasoning SKUs — see model snapshot. Treat numbers as volatile; treat budgeting discipline as durable.
Hybrid and efficient architectures (concept level)
Research and open-weight stacks increasingly mix State Space Models (SSM) — e.g. Mamba-class layers — with transformer attention for long sequences at lower memory. You rarely pick this as an app engineer today; you might choose an inference provider or open model advertising better long-context $/token or tokens/sec.
What to carry away:
- Attention scales poorly with very long contexts; hybrids target cheaper long-range processing.
- Serving economics (throughput on inference servers) matter as much as benchmark scores for high-volume products.
- Diffusion language models and other non-autoregressive generators are an active research thread — interesting for latency-shaped workloads, not yet the default app stack.
Do not rewrite your product around a paper; evaluate on your eval set when a new architecture ships as a hosted model.
Agents multiply cost
One user message can become ten model calls plus retrieval. Efficient inference is mandatory for agent products:
- Cache stable prefixes (prompt caching)
- Batch offline jobs (batch inference)
- Route tool-planning to workhorse, final polish to frontier only when needed
- Enforce trajectory efficiency metrics
Duolingo Max is a case study in per-turn cost control — persona and quality without frontier pricing on every exchange.
→ Next: Research radar (June 2026)