Skip to main content

Ops — observability, on-call, and kill switches

In one line: A solo dev is the on-call and is the kill switch (it's a feature flag); a startup runs PagerDuty + Langfuse + a single big red disable_ai flag; an enterprise runs follow-the-sun on-call, an LLM-aware SIEM, severity-tiered runbooks, and per-feature kill switches wired into the gateway.

In plain English

AI ops is regular ops plus three new failure modes: the model gets dumber on a Tuesday (drift), the bill suddenly 10x's (a runaway agent loop), and the model says something that ends up screenshotted on Twitter (a quality/safety regression).

Solo handles all three with "I check the dashboard sometimes." Startup wires alerts for each one. Enterprise has a dedicated person watching each one in shifts. The list of failure modes is identical at every scale — only the response time and accountability differ.

Observability stack

AspectSoloStartupEnterprise
LLM call tracesLangfuse free tierLangfuse Pro / Helicone / ArizeDatadog LLM + Langfuse + corporate SIEM
Cost dashboardProvider dashboardPer-tenant cost dashboardFinOps tooling + per-team chargeback
Eval results in prodNone (eval is pre-merge)Nightly drift eval, alert on regressionContinuous shadow evals on live traffic
Latency monitoringVercel built-inDatadog / Better StackDatadog APM + custom SLO dashboards
Prompt + completion loggingLangfuse (no PII filter)Langfuse with PII redactionGateway-level redaction → SIEM with retention policy
User feedback signalThumbs in the UIThumbs + structured outcome trackingThumbs + outcome + human-rater audit pipeline
Synthetic monitoringNoneDaily canary promptsMulti-region synthetic prompts every minute

On-call

AspectSoloStartupEnterprise
Who's pagedYour phone, sometimesEngineer rotation (Slack + PagerDuty)Follow-the-sun rotation across regions
CoverageBest effortBusiness hours + a "best-effort" overnight24/7 with documented SLA
Runbook quality"I'll remember"A Notion page per known failureSeverity-tiered runbooks per service, drilled quarterly
First responder skillYouGeneralist on-call engineerL1 ops → escalate to L2 AI engineer → L3 platform
Time-to-acknowledge targetNone15 minutes business hours5 minutes SEV1, 15 minutes SEV2
Post-mortemNoneFor anything user-visibleFormal blameless for SEV1/SEV2 with action-item tracking

Kill-switch process

The most distinctly AI piece of ops. Models misbehave in ways traditional services don't, and you need a way to turn the feature off without rolling back code.

AspectSoloStartupEnterprise
GranularityBoolean: AI on/offPer-feature flagPer-feature × per-tenant × per-model
MechanismEnv var flip + redeployFeature flag in Statsig/PostHog (instant)Gateway-level routing rule (instant, audited)
Who can flip itYouOn-call engineerOn-call engineer + auto-flip on SLO breach
Time to disable in prod60 seconds + redeploy< 30 seconds, no deploy< 10 seconds, fully audited
What flipping doesReturns a "service unavailable" messageFalls back to non-AI flow or canned responseFalls back to non-AI flow + emits compliance event + auto-files incident
Tested how oftenNeverOnce when it was builtQuarterly fire drill
Quality alert firesSolo: noticetomorrow,flip env var,redeployStartup: PagerDutypageson-call → flip flaginStatsig (30 sec)Enterprise: alertauto-routeskill switch flips,on-call confirms,incident channelopens (10 sec)

Incident severity

SeveritySoloStartupEnterprise
SEV1 (worst)"Everything's down"Outage, data leak, or hallucination on the front pageSame + breach affecting >X users, regulator-notifiable, or revenue-impacting
SEV2Major feature brokenMajor feature broken or quality dropped >20%Same + customer-visible SLA risk
SEV3A test failedMinor regression, eval drift > thresholdSame + per-tenant impact
Who declaresYouOn-call engineerOn-call engineer; escalates to incident commander
CommsNoneStatus page + internal SlackStatus page + customer emails + executive brief + (sometimes) regulatory filing
Required artifactMaybe a tweetPost-mortem in NotionFormal incident report, root-cause doc, action items tracked to closure
Highlight: the AI-specific failure modes traditional ops misses

A traditional ops team watches uptime, latency, and error rate. An AI ops team adds:

  1. Quality drift — the model still returns 200s, but the answers got worse. Caught only by continuous evals on live traffic, not by uptime monitoring.
  2. Cost blowup — a runaway agent loop or a single user with a 200K-token prompt can 10x the daily bill in minutes. Caught only by per-tenant cost alerts.
  3. Safety regression — the model starts saying something it shouldn't. Caught by output classifiers running on a sample of completions.
  4. Provider degradation — the model didn't change, but the provider's serving infrastructure got slower or returned more errors. Caught by latency and error-rate SLOs per provider.

A startup that watches only the first set of metrics will be surprised by the second set, often loudly.

Worked example: same hallucination incident, three orgs

The customer-support assistant starts confidently quoting a refund policy that doesn't exist. A customer screenshots it.

  • Solo: notice the angry tweet at 9am, flip AI_REPLIES_ENABLED=false in Vercel, redeploy (60 seconds). Spend the afternoon adjusting the prompt and re-enabling. Total downtime of AI feature: ~2 hours. Documentation: a tweet apologizing.
  • Startup: PagerDuty pages on-call at 2am from the synthetic eval going red. On-call flips the kill switch in Statsig (30 seconds), opens a Slack incident channel, drafts a status page entry. Morning standup triages the root cause; eval gets a new case; prompt gets fixed; flag re-enabled by noon. Total downtime: ~10 hours. Documentation: a one-page post-mortem.
  • Enterprise: SIEM correlates the bad completion with prior similar completions across tenants; auto-routes to incident commander; kill switch auto-flips for the affected tenant cohort; comms team drafts customer email; legal reviews whether the misstatement is materially actionable; root-cause requires a 2-week project (retrieval over a stale knowledge base); action items: stale-KB detector, new eval suite, contract amendment for the affected enterprise customer. Total downtime for cohort: ~30 minutes. Documentation: 15-page incident report, 8 action items, board-level summary.

Same bug. Three different blast radii. Each response is appropriately sized.

What stays the same / what changes

Stays the same: every column has some dashboard, some alert, some kill switch. The list of failure modes (drift, cost, safety, provider degradation) is identical.

Changes: the response time, the automation level, the number of artifacts produced per incident, and the blast radius the on-call is held accountable for.

Adoption order for ops investment

A useful sequence when you cross into the next column — adopt in this order, not all at once:

  1. A kill switch you've actually flipped in production. Universal across columns. If you've never flipped it, it doesn't work.
  2. A cost cap and a cost alert. Solo can use the provider dashboard's built-in cap; startups need per-tenant cost dashboards before the second paying customer.
  3. A nightly eval against a stored snapshot. Catches drift. Cheap to run, expensive to skip.
  4. A documented runbook for each known failure mode. Even a Notion page beats nothing.
  5. A real on-call rotation with a paging tool. Slack notifications everyone mutes after 3 false positives don't count.
  6. Per-feature kill switches (vs. one global switch). Required once you have more than one AI feature.
  7. Continuous eval-on-live-traffic. Enterprise-tier; usually overkill before then.
  8. Audit-grade prompt+completion logging. Required when a regulator or customer-due-diligence questionnaire forces it.

Most startups get steps 1–3 right and skip 4–5, then are surprised when their first real incident has no runbook and the on-call doesn't know what to do.

Common mistakes

  • Treating LLM ops like web-app ops. Uptime green + latency green does not mean "AI is working." Quality drift hides behind 200s. You need eval-on-live-traffic, or you're flying blind.
  • One global kill switch. Solo can get away with it. Startup can almost get away with it. Enterprise must have per-feature switches or every quality bug becomes a full AI outage.
  • No fire drill on the kill switch. A kill switch that hasn't been flipped in production in 6 months probably doesn't work. Schedule a quarterly drill — flip it on purpose during business hours, confirm the fallback, flip it back.
  • Datadog LLM at 5 engineers. $50K/year for a tool whose value is in the cross-team aggregation features. Langfuse Pro + Sentry covers the same ground for $500/month until you actually have multiple teams.
  • Counting on the provider's status page. Provider status pages lag reality by 20–90 minutes. Your synthetic prompts will catch a degradation before they do.
  • A runbook nobody has read. A 40-page incident runbook last updated 18 months ago is worse than nothing — it gives false confidence and sends the on-call down dead paths. Trim ruthlessly, rehearse quarterly, or delete.
🤔 Quick checkQuick check

→ Next: Workflow comparison.