Model families
In one line: Models cluster into three tiers (frontier / workhorse / small), two licensing camps (closed API / open weights), and two thinking modes (chat / reasoning). Picking the right cell saves you 10× on cost and 5× on latency.
There isn't "the LLM." There's a whole zoo. Frontier models are Ferraris — fastest, most expensive, used when nothing cheaper works. Workhorse models are Hondas — 80% of the speed at 20% of the price, the right default. Small models are e-bikes — perfect for one specific quick trip. Closed models are SaaS; open models you can host yourself. Reasoning models think before they answer, at the cost of latency and dollars.
The three tiers
Every major provider ships the same three-tier shape. The names rotate every few months; the tiers don't. Current names and per-token prices live on the Model snapshot — this page teaches the durable shape.
Frontier
- Used for: hard reasoning, agent backbones, complex code generation, anything where you'd otherwise need a human expert.
- Examples: each provider's flagship — see the snapshot for current names.
- Price shape: roughly 4–10× the workhorse tier per token.
- Latency: 1–5 seconds time-to-first-token, often slower for reasoning models.
Workhorse
- Used for: the default for most user-facing features. Chat, summarization, classification, RAG synthesis, light coding.
- Examples: each provider's mid-size model, plus the large open-weight instruct models.
- Price shape: roughly 5–10× the small tier per token.
- Latency: 300ms–1s TTFT, 80–200 tokens/sec.
Small
- Used for: classification, extraction, routing, simple chat, heavy-volume background jobs. Distilled from a bigger model for one job.
- Examples: each provider's mini / nano / flash models, plus small open-weight models.
- Price shape: the cheapest tier — often 50–100× cheaper than frontier per token.
- Latency: sub-200ms TTFT, 200–500+ tokens/sec on dedicated fast-inference infra (1000+ is possible).
Closed vs open
- Closed (hosted only): OpenAI, Anthropic, Google. You hit an API; you don't see the weights. Best raw quality, simplest ops, but vendor lock-in and no offline.
- Open weights (downloadable): Meta (Llama), Mistral, Alibaba (Qwen), DeepSeek, Cohere, Microsoft (Phi). You can host them yourself, fine-tune them, run them air-gapped.
The quality gap between top open and top closed has narrowed to a few months on most benchmarks (see the snapshot for the current state). The lock-in gap has not.
| Need | Default |
|---|---|
| Top quality, you don't mind paying | Closed frontier |
| Data must not leave your VPC | Open, self-hosted |
| High volume, cheap-per-token | Open via managed inference |
| Compliance / customer demands offline | Open, self-hosted |
| Just shipping fast | Closed workhorse |
Reasoning models vs base chat models
A second axis. Reasoning models spend "thinking" tokens internally before answering. They're better at multi-step math, code planning, and chain-of-thought problems — at the cost of higher latency and higher cost per visible answer token.
- Reasoning: OpenAI's o-series, Claude with extended thinking, Gemini's Deep Think mode, and open-weight reasoners like DeepSeek R1 (current names: snapshot).
- Base chat: the default mode of most workhorse and frontier models.
When to reach for a reasoning model:
- Multi-step math, formal logic, theorem-y proofs.
- Code involving non-trivial planning before writing.
- Hard agentic decomposition (planner role).
- Anything where you've watched a workhorse model bluff its way through.
When NOT to:
- Latency-sensitive chat (reasoning adds 5–60 seconds).
- High-volume classification (way overkill).
- Anything a workhorse + good prompt already passes.
Worked example: picking a model for a real task
You're building a support-ticket router that reads incoming tickets and tags them with category and priority.
- Volume: 50K tickets/day.
- Latency tolerance: seconds.
- Quality requirement: >95% category accuracy.
Try, in order:
- Small model first. Any small-tier model with a structured-output schema. Cost: a few dollars a day at this volume. Run on an eval set of 200 labeled tickets.
- If accuracy is <95%: try a workhorse model. Cost: roughly 10× the small tier per day. Usually closes the gap.
- If still bad: consider fine-tuning the small model on your labeled tickets (best ROI), or only routing the hard cases to a workhorse with the small model as gatekeeper (cascade pattern).
- Frontier: almost never the right call for this. Save it for the 5% of tickets the workhorse refuses to tag.
The default is "cheapest tier that passes evals," not "most expensive that's available."
What beginners get wrong
- Always picking frontier "to be safe." You'll burn 10× the budget for no quality win on 80% of your traffic.
- Always picking small "to save money." Some tasks genuinely need a workhorse; using a small model on them hurts users and you'll churn anyway.
- Treating "open" as automatically cheaper. A self-hosted Llama on idle H100s is the most expensive model on Earth. Cheap requires utilization.
- Mixing reasoning models into latency-critical UX. Users will not wait 30 seconds for a chat bubble. Use reasoning for offline or "deep research" flows only.
- Pinning to a specific model version forever. Models deprecate. Build your code so the model name is one environment variable away from being swapped.
- Not running evals before switching. "The new model came out, let's switch" without an eval set is how regressions ship to prod.
Run a small model first. If its confidence (or a cheap check) says "I'm not sure," escalate to a workhorse. If the workhorse still struggles, escalate to a frontier. Most traffic stays on the small model; quality matches the frontier on the few that matter. 5–20× cost reduction is typical.