Skip to main content

Model families

In one line: Models cluster into three tiers (frontier / workhorse / small), two licensing camps (closed API / open weights), and two thinking modes (chat / reasoning). Picking the right cell saves you 10× on cost and 5× on latency.

In plain English

There isn't "the LLM." There's a whole zoo. Frontier models are Ferraris — fastest, most expensive, used when nothing cheaper works. Workhorse models are Hondas — 80% of the speed at 20% of the price, the right default. Small models are e-bikes — perfect for one specific quick trip. Closed models are SaaS; open models you can host yourself. Reasoning models think before they answer, at the cost of latency and dollars.

The three tiers

Every major provider ships the same three-tier shape. The names rotate every few months; the tiers don't. Current names and per-token prices live on the Model snapshot — this page teaches the durable shape.

Frontier — top of leaderboards, most expensiveEach provider's flagshipReasoning variantsWorkhorse — ~80% quality, 10-20% priceEach provider's mid modelLarge open-weight modelsSmall — fast, cheap, focusedEach provider's mini/nano/flashSmall open-weight models

Frontier

  • Used for: hard reasoning, agent backbones, complex code generation, anything where you'd otherwise need a human expert.
  • Examples: each provider's flagship — see the snapshot for current names.
  • Price shape: roughly 4–10× the workhorse tier per token.
  • Latency: 1–5 seconds time-to-first-token, often slower for reasoning models.

Workhorse

  • Used for: the default for most user-facing features. Chat, summarization, classification, RAG synthesis, light coding.
  • Examples: each provider's mid-size model, plus the large open-weight instruct models.
  • Price shape: roughly 5–10× the small tier per token.
  • Latency: 300ms–1s TTFT, 80–200 tokens/sec.

Small

  • Used for: classification, extraction, routing, simple chat, heavy-volume background jobs. Distilled from a bigger model for one job.
  • Examples: each provider's mini / nano / flash models, plus small open-weight models.
  • Price shape: the cheapest tier — often 50–100× cheaper than frontier per token.
  • Latency: sub-200ms TTFT, 200–500+ tokens/sec on dedicated fast-inference infra (1000+ is possible).

Closed vs open

  • Closed (hosted only): OpenAI, Anthropic, Google. You hit an API; you don't see the weights. Best raw quality, simplest ops, but vendor lock-in and no offline.
  • Open weights (downloadable): Meta (Llama), Mistral, Alibaba (Qwen), DeepSeek, Cohere, Microsoft (Phi). You can host them yourself, fine-tune them, run them air-gapped.

The quality gap between top open and top closed has narrowed to a few months on most benchmarks (see the snapshot for the current state). The lock-in gap has not.

NeedDefault
Top quality, you don't mind payingClosed frontier
Data must not leave your VPCOpen, self-hosted
High volume, cheap-per-tokenOpen via managed inference
Compliance / customer demands offlineOpen, self-hosted
Just shipping fastClosed workhorse

Reasoning models vs base chat models

A second axis. Reasoning models spend "thinking" tokens internally before answering. They're better at multi-step math, code planning, and chain-of-thought problems — at the cost of higher latency and higher cost per visible answer token.

  • Reasoning: OpenAI's o-series, Claude with extended thinking, Gemini's Deep Think mode, and open-weight reasoners like DeepSeek R1 (current names: snapshot).
  • Base chat: the default mode of most workhorse and frontier models.
User questionReasoning model?Hidden 'thinking'tokens1K-30K tokens,billedVisible answeryesno

When to reach for a reasoning model:

  • Multi-step math, formal logic, theorem-y proofs.
  • Code involving non-trivial planning before writing.
  • Hard agentic decomposition (planner role).
  • Anything where you've watched a workhorse model bluff its way through.

When NOT to:

  • Latency-sensitive chat (reasoning adds 5–60 seconds).
  • High-volume classification (way overkill).
  • Anything a workhorse + good prompt already passes.

Worked example: picking a model for a real task

You're building a support-ticket router that reads incoming tickets and tags them with category and priority.

  • Volume: 50K tickets/day.
  • Latency tolerance: seconds.
  • Quality requirement: >95% category accuracy.

Try, in order:

  1. Small model first. Any small-tier model with a structured-output schema. Cost: a few dollars a day at this volume. Run on an eval set of 200 labeled tickets.
  2. If accuracy is <95%: try a workhorse model. Cost: roughly 10× the small tier per day. Usually closes the gap.
  3. If still bad: consider fine-tuning the small model on your labeled tickets (best ROI), or only routing the hard cases to a workhorse with the small model as gatekeeper (cascade pattern).
  4. Frontier: almost never the right call for this. Save it for the 5% of tickets the workhorse refuses to tag.

The default is "cheapest tier that passes evals," not "most expensive that's available."

What beginners get wrong

Common mistakes
  • Always picking frontier "to be safe." You'll burn 10× the budget for no quality win on 80% of your traffic.
  • Always picking small "to save money." Some tasks genuinely need a workhorse; using a small model on them hurts users and you'll churn anyway.
  • Treating "open" as automatically cheaper. A self-hosted Llama on idle H100s is the most expensive model on Earth. Cheap requires utilization.
  • Mixing reasoning models into latency-critical UX. Users will not wait 30 seconds for a chat bubble. Use reasoning for offline or "deep research" flows only.
  • Pinning to a specific model version forever. Models deprecate. Build your code so the model name is one environment variable away from being swapped.
  • Not running evals before switching. "The new model came out, let's switch" without an eval set is how regressions ship to prod.
Highlight: the cascade pattern, your single best cost lever

Run a small model first. If its confidence (or a cheap check) says "I'm not sure," escalate to a workhorse. If the workhorse still struggles, escalate to a frontier. Most traffic stays on the small model; quality matches the frontier on the few that matter. 5–20× cost reduction is typical.

🤔 Quick checkQuick check

→ Next: Messages: system, user, assistant