Deployment & Rollouts
In one line: Every AI feature ships behind a flag, rolls out by cohort, has a kill switch flippable in under a minute, and is treated as a change-managed release — not a casual push.
A button color change can ship to 100% of users instantly. A new AI feature absolutely cannot. The model is non-deterministic; the output can degrade with traffic shape; the cost can spike unexpectedly; a single bad output to a high-profile customer can become a tweet. Every AI deploy is treated more like a regulated release than a routine push.
Feature flags as the deployment primitive
Every AI feature ships behind a flag from day one. Use PostHog or Statsig (both have generous free tiers up to a few million events/month).
const enabled = await posthog.isFeatureEnabled(
"ai-clause-extractor",
{ user, tenant }
);
if (!enabled) return <FallbackUI />;
return <AIClauseExtractor />;
Three properties every AI flag has:
- Per-tenant scope. Roll out to specific customer cohorts, not just percentages of users.
- Kill switch. A single toggle that flips to
falsefor everyone in under a minute. - Targeting rules. Internal-only, beta-customers, free-tier, paid-tier, etc.
The cohort rollout pattern
| Stage | Cohort | Watch for |
|---|---|---|
| 0 — Internal | Team only | Obvious bugs, broken UX |
| 1 — Friends-of-house | Select friendly customers (5–10 tenants) | Real-world failures, missing eval cases |
| 2 — Canary 5% | 5% of paid users, random | Quality score on prod traces, cost, p95 |
| 3 — 25% | 25% | Same metrics, with more signal |
| 4 — 50% | 50% | Cost trends start to be reliable |
| 5 — 100% | All users | Steady state monitoring |
Typical timeline: stages 0–3 take a week; 4–5 take another week. Faster on Tier-3 features, slower on Tier-0/1.
The kill switch
Every AI feature must have a single toggle that disables it for all users in under a minute.
- Lives in the feature flag platform (PostHog / Statsig) — not in code, not in env vars.
- The on-call engineer has the link to it bookmarked.
- Test it monthly. Anyone on call who can't demonstrate flipping it doesn't go on call.
- When flipped: feature returns the graceful fallback UI ("temporarily unavailable" or non-AI path).
Triggers for flipping the kill switch:
- Cost spike beyond 5x normal/hour.
- Quality score drop beyond a defined threshold on the prod-trace LLM-as-judge.
- Provider incident (gateway already does failover, but if both providers are down).
- Public-facing bad output (PR fire).
- Customer-reported regression that requires investigation before continuing.
Deploy windows
| Change type | Allowed window |
|---|---|
| Code (non-AI) | Mon–Thu 9–5, Fri until 2pm |
| Tier-3 AI feature | Same |
| Tier-2 prompt change | Mon–Thu |
| Tier-1 prompt change | Mon–Wed |
| Tier-0 prompt change | Tue or Wed only, with senior eng + PM on call |
| Hotfix (any) | Any time, with on-call engineer monitoring |
| Model version pin bump | Tue or Wed, with full eval suite re-run |
The point: more important changes get more daylight engineering hours for response.
Per-tenant rollout for white-glove accounts
Top customers often get opted out of new features by default and opt in explicitly when they're ready. Pattern:
- A "stability tier" tag per tenant in PostHog.
- "Stable" tenants only get features at 100% rollout, after 2 weeks of steady-state.
- "Early access" tenants get features at stage 1 (friends-of-house).
- The default tenant is "standard" — gets features at stage 3 onward.
This costs a tiny amount of complexity. It saves the bigger conversation of "we shipped a regression to your $200K/year contract on a Friday."
Provider failover via gateway
The gateway (Portkey, OpenRouter, LiteLLM) handles model-provider failover automatically. Verify monthly:
- Manually disable the primary key in the gateway dashboard.
- Watch traffic shift to the fallback.
- Verify p95 latency stays under SLO.
- Re-enable.
This is the test that prevents the worst kind of incident: "Anthropic is down for 4 hours and our app is down with them."
Migrations: app deploys vs prompt deploys
Two different rituals:
- App code deploys: Vercel handles this automatically on merge. Preview → production. Standard web-deploy rules.
- Prompt + model changes: Also deployed via code (because prompts live in code), but additionally cohort-rolled via feature flags. The prompt-change part of a PR may stay at 5% for a week while the code part of the same PR is at 100%.
This is why eval-gating in CI and feature-flag cohorts are both needed — they handle different risks at different layers.
A 30-person AI startup's prompt change passed all evals, deployed to canary, soaked 24 hours, and rolled to 25%. At hour 36 of the 25% rollout, a high-volume customer hit a specific input shape that produced loops the eval set didn't cover. Cost for that customer jumped 18x in two hours.
The on-call engineer saw the cost alert, opened PostHog, flipped the kill switch. Feature reverted to the previous prompt globally in 45 seconds. The customer never noticed; their dashboard never showed weirdness; they just got the slightly older but stable version of the feature.
Post-mortem: add an eval case for the input shape, fix the prompt, redeploy through the same gates. Total customer-visible impact: zero. Total team stress: low because the kill switch was a reflex.
Engineers raised on "push to prod whenever" sometimes resist cohort rollouts. The reframe: AI features aren't more dangerous because they're AI; they're more dangerous because their failure mode is fluent and confident. A buggy non-AI feature breaks visibly. A buggy AI feature lies smoothly.
Cohort rollouts buy you time to spot fluent failures before they reach everyone. Two weeks of staged rollout for a high-stakes feature is cheap insurance compared to a 50% churn moment.
Common mistakes
- No kill switch. Every quarter someone needs one urgently. Every quarter a team without one ships an incident they could have stopped in seconds.
- Manual cohort management. Engineers hand-editing user lists in admin panels. Use the flag platform's cohort tooling; never let cohort logic live in product code.
- No "stability tier" for top accounts. Friday afternoon, a roll-out hits your biggest customer, things go sideways. They cared about reliability, not the new feature. Tier them differently from day one.
- Treating model upgrades as routine. Anthropic / OpenAI release new models monthly. Each is a Tier-1 change even if it looks better — different latency curves, different failure modes, different cost. Eval, cohort, roll out.
- Skipping the monthly failover drill. It works in theory until the day you need it. Drill quarterly minimum, monthly ideally.
What's next
→ Continue to Observability where we cover Langfuse/Braintrust + Datadog dashboards and the quality/cost/latency triple.