Skip to main content

Part 14: Production Patterns

The patterns that turn an LLM demo into something you can charge for.

In one line: Every successful production LLM app is a recombination of the same dozen patterns — streaming, structured output, tool use, RAG, agents, evals, caching, cost control, embeddings, multimodal, safety, and graceful fallbacks. Master them once; reuse forever.

In plain English

A demo proves the model can do the thing. A product proves you can do the thing for ten thousand users, every day, without lighting your budget on fire or leaking customer data. These patterns are what stand between the two. None of them are exotic — most are software-engineering hygiene applied to a stochastic, expensive, network-bound component.

The patterns

Core delivery patterns

Operational patterns

Adjacent patterns

Putting it together

The mental model

Treat an LLM call like a slow, expensive, non-deterministic, network-bound function.

That's not a put-down — it's a checklist. Everything that's true of slow, expensive, non-deterministic, network-bound functions is true of LLM calls, and the standard software response applies to each property:

PropertyStandard response
SlowStream output, run things in parallel, cache.
ExpensiveTier models, cap budgets, cache, summarize history.
Non-deterministicConstrain with schemas, validate output, evaluate continuously.
Network-boundRetry, time-out, queue, fall back.

Every pattern in this chapter is one of those responses, dressed up.

User requestCheap path?cache / small modelResponseRetrieve contextRAG / toolsGeneratestream + structuredoutputValidateschema / citations /policyFallbackretry / bigger model/ humanhitmissokbad

That's the whole game. Each pattern in this chapter fills in one of those boxes.

How to read this chapter

Each page is a self-contained pattern with:

  1. The shape — what the pattern looks like.
  2. Why it matters — what you lose without it.
  3. Worked example — real code in TypeScript or Python.
  4. Watch out for — gotchas that bite.
  5. 2026 stack — what library implements this today.

Use them as a checklist when designing a new feature, and as a debugging checklist when an existing feature underperforms.

The through-line example

Pages reference a single recurring worked example: a customer-support assistant that combines RAG, tool calls, evals, and human escalation. We build it incrementally — each pattern is one layer — and the final page (complete example) glues them together.


→ Start with Streaming UX.