Skip to main content

Deployment & Rollouts

In one line: Every AI feature ships behind a flag, rolls out by cohort, has a kill switch flippable in under a minute, and is treated as a change-managed release — not a casual push.

In plain English

A button color change can ship to 100% of users instantly. A new AI feature absolutely cannot. The model is non-deterministic; the output can degrade with traffic shape; the cost can spike unexpectedly; a single bad output to a high-profile customer can become a tweet. Every AI deploy is treated more like a regulated release than a routine push.

Feature flags as the deployment primitive

Every AI feature ships behind a flag from day one. Use PostHog or Statsig (both have generous free tiers up to a few million events/month).

const enabled = await posthog.isFeatureEnabled(
"ai-clause-extractor",
{ user, tenant }
);
if (!enabled) return <FallbackUI />;
return <AIClauseExtractor />;

Three properties every AI flag has:

  1. Per-tenant scope. Roll out to specific customer cohorts, not just percentages of users.
  2. Kill switch. A single toggle that flips to false for everyone in under a minute.
  3. Targeting rules. Internal-only, beta-customers, free-tier, paid-tier, etc.

The cohort rollout pattern

StageCohortWatch for
0 — InternalTeam onlyObvious bugs, broken UX
1 — Friends-of-houseSelect friendly customers (5–10 tenants)Real-world failures, missing eval cases
2 — Canary 5%5% of paid users, randomQuality score on prod traces, cost, p95
3 — 25%25%Same metrics, with more signal
4 — 50%50%Cost trends start to be reliable
5 — 100%All usersSteady state monitoring

Typical timeline: stages 0–3 take a week; 4–5 take another week. Faster on Tier-3 features, slower on Tier-0/1.

The kill switch

Every AI feature must have a single toggle that disables it for all users in under a minute.

  • Lives in the feature flag platform (PostHog / Statsig) — not in code, not in env vars.
  • The on-call engineer has the link to it bookmarked.
  • Test it monthly. Anyone on call who can't demonstrate flipping it doesn't go on call.
  • When flipped: feature returns the graceful fallback UI ("temporarily unavailable" or non-AI path).

Triggers for flipping the kill switch:

  • Cost spike beyond 5x normal/hour.
  • Quality score drop beyond a defined threshold on the prod-trace LLM-as-judge.
  • Provider incident (gateway already does failover, but if both providers are down).
  • Public-facing bad output (PR fire).
  • Customer-reported regression that requires investigation before continuing.

Deploy windows

Change typeAllowed window
Code (non-AI)Mon–Thu 9–5, Fri until 2pm
Tier-3 AI featureSame
Tier-2 prompt changeMon–Thu
Tier-1 prompt changeMon–Wed
Tier-0 prompt changeTue or Wed only, with senior eng + PM on call
Hotfix (any)Any time, with on-call engineer monitoring
Model version pin bumpTue or Wed, with full eval suite re-run

The point: more important changes get more daylight engineering hours for response.

Per-tenant rollout for white-glove accounts

Top customers often get opted out of new features by default and opt in explicitly when they're ready. Pattern:

  • A "stability tier" tag per tenant in PostHog.
  • "Stable" tenants only get features at 100% rollout, after 2 weeks of steady-state.
  • "Early access" tenants get features at stage 1 (friends-of-house).
  • The default tenant is "standard" — gets features at stage 3 onward.

This costs a tiny amount of complexity. It saves the bigger conversation of "we shipped a regression to your $200K/year contract on a Friday."

Provider failover via gateway

The gateway (Portkey, OpenRouter, LiteLLM) handles model-provider failover automatically. Verify monthly:

  • Manually disable the primary key in the gateway dashboard.
  • Watch traffic shift to the fallback.
  • Verify p95 latency stays under SLO.
  • Re-enable.

This is the test that prevents the worst kind of incident: "Anthropic is down for 4 hours and our app is down with them."

Migrations: app deploys vs prompt deploys

Two different rituals:

  • App code deploys: Vercel handles this automatically on merge. Preview → production. Standard web-deploy rules.
  • Prompt + model changes: Also deployed via code (because prompts live in code), but additionally cohort-rolled via feature flags. The prompt-change part of a PR may stay at 5% for a week while the code part of the same PR is at 100%.

This is why eval-gating in CI and feature-flag cohorts are both needed — they handle different risks at different layers.

Worked example: kill switch saved an account

A 30-person AI startup's prompt change passed all evals, deployed to canary, soaked 24 hours, and rolled to 25%. At hour 36 of the 25% rollout, a high-volume customer hit a specific input shape that produced loops the eval set didn't cover. Cost for that customer jumped 18x in two hours.

The on-call engineer saw the cost alert, opened PostHog, flipped the kill switch. Feature reverted to the previous prompt globally in 45 seconds. The customer never noticed; their dashboard never showed weirdness; they just got the slightly older but stable version of the feature.

Post-mortem: add an eval case for the input shape, fix the prompt, redeploy through the same gates. Total customer-visible impact: zero. Total team stress: low because the kill switch was a reflex.

Highlight: AI is in the change-management process

Engineers raised on "push to prod whenever" sometimes resist cohort rollouts. The reframe: AI features aren't more dangerous because they're AI; they're more dangerous because their failure mode is fluent and confident. A buggy non-AI feature breaks visibly. A buggy AI feature lies smoothly.

Cohort rollouts buy you time to spot fluent failures before they reach everyone. Two weeks of staged rollout for a high-stakes feature is cheap insurance compared to a 50% churn moment.

Common mistakes

Where people commonly trip up
  • No kill switch. Every quarter someone needs one urgently. Every quarter a team without one ships an incident they could have stopped in seconds.
  • Manual cohort management. Engineers hand-editing user lists in admin panels. Use the flag platform's cohort tooling; never let cohort logic live in product code.
  • No "stability tier" for top accounts. Friday afternoon, a roll-out hits your biggest customer, things go sideways. They cared about reliability, not the new feature. Tier them differently from day one.
  • Treating model upgrades as routine. Anthropic / OpenAI release new models monthly. Each is a Tier-1 change even if it looks better — different latency curves, different failure modes, different cost. Eval, cohort, roll out.
  • Skipping the monthly failover drill. It works in theory until the day you need it. Drill quarterly minimum, monthly ideally.
🤔 Quick checkQuick check

What's next

→ Continue to Observability where we cover Langfuse/Braintrust + Datadog dashboards and the quality/cost/latency triple.