Skip to main content

Efficient models & test-time compute

In one line: Not every call needs the frontier model — routing, hybrid architectures, and bounded test-time compute (reasoning tokens) are how teams afford agents at scale.

In plain English

Test-time compute means spending extra work at answer time — more thinking tokens, more search steps, a second model pass — to get a harder question right. The engineering skill is budgeting that spend: cheap models for easy turns, reasoning models for gnarly ones, and hard caps so one user request cannot drain your margin.

The three-lever mental model

LeverWhat you controlTypical use
Model tierHaiku vs. Sonnet vs. Opus / nano vs. mini vs. flagshipRoute by task difficulty
Test-time computeReasoning depth, self-consistency samples, verifier passHard math, code, multi-step planning
Architecture classPure transformer vs. hybrid SSM+attentionLong context, throughput-sensitive serving

Current names and prices live on the model snapshot. The ratios (frontier ≈ 4–10× workhorse) survive every price cut.

Model routing in production

The modern default is a router (rules, classifier, or small model) that picks a tier per request:

Incoming requestRouterSmall / cheap tierWorkhorse tierFrontier orreasoning tierResponsesimple FAQstandard taskhard reasoning

Patterns from production patterns and decision frameworks:

  • Cascade — try cheap first; escalate to expensive only if confidence is low or eval fails on a sample.
  • Parallel verify — cheap draft + small verifier model (cheaper than one frontier call for some tasks).
  • Per-step routing in agents — grep might be a tiny model; architecture design might be reasoning tier.

Routing without evals is guesswork — measure win rate and cost per tier on your dataset.

Test-time compute (reasoning budgets)

Reasoning models spend extra tokens thinking before answering. That is test-time compute: more inference work, better results on hard problems, higher bill and latency.

Harness responsibilities:

  • Cap thinking tokens per request and per agent loop iteration
  • Expose a user-visible tradeoff — fast vs. deep mode
  • Fall back when budget exhausted — partial answer or ask to narrow the question
June 2026 snapshot

Thinking tokens often add 1K–30K tokens and 5–60s latency per answer on frontier reasoning SKUs — see model snapshot. Treat numbers as volatile; treat budgeting discipline as durable.

Hybrid and efficient architectures (concept level)

Research and open-weight stacks increasingly mix State Space Models (SSM) — e.g. Mamba-class layers — with transformer attention for long sequences at lower memory. You rarely pick this as an app engineer today; you might choose an inference provider or open model advertising better long-context $/token or tokens/sec.

What to carry away:

  • Attention scales poorly with very long contexts; hybrids target cheaper long-range processing.
  • Serving economics (throughput on inference servers) matter as much as benchmark scores for high-volume products.
  • Diffusion language models and other non-autoregressive generators are an active research thread — interesting for latency-shaped workloads, not yet the default app stack.

Do not rewrite your product around a paper; evaluate on your eval set when a new architecture ships as a hosted model.

Agents multiply cost

One user message can become ten model calls plus retrieval. Efficient inference is mandatory for agent products:

Duolingo Max is a case study in per-turn cost control — persona and quality without frontier pricing on every exchange.


→ Next: Research radar (June 2026)

🤔 Quick checkQuick check