Part 14: Production Patterns
The patterns that turn an LLM demo into something you can charge for.
In one line: Every successful production LLM app is a recombination of the same dozen patterns — streaming, structured output, tool use, RAG, agents, evals, caching, cost control, embeddings, multimodal, safety, and graceful fallbacks. Master them once; reuse forever.
A demo proves the model can do the thing. A product proves you can do the thing for ten thousand users, every day, without lighting your budget on fire or leaking customer data. These patterns are what stand between the two. None of them are exotic — most are software-engineering hygiene applied to a stochastic, expensive, network-bound component.
The patterns
Core delivery patterns
- Streaming UX — Sub-second perceived latency for every user-facing call.
- Structured output everywhere — JSON / typed objects as the default output shape.
- Tool use done right — Tight tool sets, great descriptions, parallel execution.
- The RAG pattern in production — Hybrid search, reranking, citations, evals.
- The agent loop with guardrails — Iteration caps, observability per step, human handoff.
- Coding agents — Context strategies, diff application, test-loop verification — the patterns behind Cursor / Claude Code / Aider.
Operational patterns
- Evals as a product surface — LLM-as-judge, regression sets, prod sampling.
- Caching for cost & latency — Response cache, prompt cache, embedding cache.
- Cost control patterns — Tiered models, prompt trimming, rate limits, kill switches.
Adjacent patterns
- Embeddings & semantic search — Search, recs, dedup, classification — even without an LLM at query time.
- Multimodal patterns — Vision-first extraction, transcribe-then-process, the "give the model the image" trick.
- Safety & privacy — Prompt injection, PII scrubbing, authorization in RAG.
- Fallbacks & graceful degradation — Tiered fallback, cached response, non-AI path, "temporarily unavailable" UX.
- The LLM debugging playbook — Symptom → likely root causes → first three things to check. The reference card for when something's misbehaving.
- Safe model swaps & canary deploys — Shadow traffic, canary rollouts, A/B tests for prompt and model changes — the deployment discipline that prevents silent regressions.
- LLMOps — the production operations loop — The closed loop that ties every other pattern together: version, roll out, observe, evaluate in prod, and feed failures back. The "SRE for AI" synthesis.
Putting it together
- Complete worked example — Customer-support assistant with RAG + tools + escalation, with evals, deploy, and observability.
- Chapter checkpoint — Self-test.
The mental model
Treat an LLM call like a slow, expensive, non-deterministic, network-bound function.
That's not a put-down — it's a checklist. Everything that's true of slow, expensive, non-deterministic, network-bound functions is true of LLM calls, and the standard software response applies to each property:
| Property | Standard response |
|---|---|
| Slow | Stream output, run things in parallel, cache. |
| Expensive | Tier models, cap budgets, cache, summarize history. |
| Non-deterministic | Constrain with schemas, validate output, evaluate continuously. |
| Network-bound | Retry, time-out, queue, fall back. |
Every pattern in this chapter is one of those responses, dressed up.
That's the whole game. Each pattern in this chapter fills in one of those boxes.
How to read this chapter
Each page is a self-contained pattern with:
- The shape — what the pattern looks like.
- Why it matters — what you lose without it.
- Worked example — real code in TypeScript or Python.
- Watch out for — gotchas that bite.
- 2026 stack — what library implements this today.
Use them as a checklist when designing a new feature, and as a debugging checklist when an existing feature underperforms.
Pages reference a single recurring worked example: a customer-support assistant that combines RAG, tool calls, evals, and human escalation. We build it incrementally — each pattern is one layer — and the final page (complete example) glues them together.
→ Start with Streaming UX.