Part 13: Decision Frameworks

The recurring "should we…" debates, with decision rules instead of vibes.

In one line: Most AI engineering decisions are not novel; they're variations on twenty recurring questions. Internalize the rules and you'll spend debate time on the actually-hard parts.

In plain English

Every AI feature kicks off the same arguments. "Should this be an agent?" "Do we need RAG?" "Open or closed model?" "Build it ourselves or use LangChain?" In 2026, almost none of these are open questions — there are good default answers and a small number of conditions where the default flips. This chapter gives you the defaults, the flip conditions, and the language to talk about them.

Why this chapter exists

Most AI projects don't fail on the model or the prompt. They fail because someone picked the wrong shape for the system — a multi-agent setup when a chain would do, fine-tuning when prompting was fine, a self-hosted LLM when the API was cheaper, an "agent platform" when raw SDK plus a function would have shipped in a week.

These are recurring decisions. They show up in the first week of a project, again at the first scale wall, and again every time the team turns over. The cost of getting them wrong is months of wasted engineering and, more often, a quietly cancelled launch.

This chapter is a compressed playbook. Each page is a single decision rule: when does it apply, what's the default, what's the override.

Jargon used throughout this chapter

Boring AI — choosing the most-deployed proven option (e.g., GPT-4.1, Claude Sonnet, OpenAI Embeddings) over the leaderboard winner. The AI version of "boring technology."
Reversibility ladder — the ranked cost of unwinding an AI choice. Prompt changes are cheap; self-host migrations are not.
Eval bar — the minimum measured quality (accuracy, helpfulness, refusal rate) a feature has to hit before shipping or before a model swap.
Gateway tax — the latency and cost of routing every LLM call through a model-routing service (Portkey, OpenRouter, LiteLLM).
Agent runtime — the long-running execution layer that owns an agent's loop, state, tool use, retries, and observability.
Strangler ramp — feature-flagged rollout from 1% → 100% of traffic, with shadow comparison and a kill switch. The only safe AI cutover.
Asymmetric upside/downside — a feature where the best plausible outcome is far larger (in revenue, retention, time saved) than the worst plausible outcome. AI features are often asymmetric.
Kill switch — a single config flag that disables an AI feature or routes it to a deterministic fallback. Every production AI feature must have one.

Highlight: the single most important principle

Boring AI beats exciting AI in almost every situation that matters. The frontier model on the LMArena leaderboard last week is exciting; GPT-4.1 and Claude Sonnet are boring. Boring means: documented failure modes, ecosystem-wide observability, predictable cost, a stable SDK that won't break in three months. Boring is what your future self maintaining the system actually wants.

When two AI options look roughly equal, pick the one with more production deployment hours behind it, not the one with a higher benchmark number.

How to read this chapter

Each page is short and prescriptive. The structure is identical:

One-line rule. What you'd put on a sticky note.
Plain-English version. What it means without jargon.
The default. What 80% of teams should do.
When it doesn't apply. The 20% override cases — with the evidence you'd need.
Real-world examples. What this looks like in production code and team decisions.
Next. A pointer to the next rule.

Use them as a check before you commit to an architecture: did I actually justify this against the rule, or did I default to the trendy answer?

The rules, in the order they bite

Foundational mindset

Decisions overview — this page.
Pick boring models — the most-deployed model that passes evals.
The reversibility ladder — optimize for the cheap-to-undo rungs.
Team-size heuristic — what AI tooling each team size can support.

The classic architecture forks

Prompt vs RAG vs fine-tune — try in this exact order.
Agent vs chain vs multi-agent — chain by default.
Closed vs open-weight model — when each wins.
Build vs buy — the buy-leaning defaults for every layer.
When not to use AI — the question that saves quarters.

Engineering investment

Eval investment — what fraction of eng time to spend on evals.
Cost of inaction — the cost of NOT shipping an AI feature.
When to rebuild — signals for a full rebuild vs incremental.
Single vs multi-provider — when the gateway tax is worth it.
Sync vs async — streaming chat vs background workflow.
On-prem vs cloud — when self-hosted earns its cost.
Framework vs raw SDK — build a raw v0 first.
Prompt engineering vs fine-tuning — the explicit escalation rule. 17b. Fine-tuning — the decision walkthrough — once you've decided to fine-tune, SFT vs DPO vs RFT with worked numbers.

Risk, planning, and people

What would hurt — the worst-plausible-failure pre-mortem.
When to buy an agent platform — build vs Cognition / Crew / Sierra.
Hiring constraint — what "AI engineer" actually means in 2026.

Putting it together

The 1-page checklist — model, RAG, agent, eval bar, kill switch, cost cap, owner.
When to override these rules — when intuition beats process.

Checkpoint

Chapter 13 checkpoint — a short self-test.

→ Start with Pick boring models.

Why this chapter exists​

How to read this chapter​

The rules, in the order they bite​

Foundational mindset​

The classic architecture forks​

Engineering investment​

Risk, planning, and people​

Putting it together​

Checkpoint​