CI/CD for AI Startups
In one line: Lint, type-check, unit + integration tests, eval suite, preview deploy, then cohort rollout. Every step blocks the next. Eval and adversarial suites are gates, not warnings.
A regular CI pipeline answers "does the code work?" An AI CI pipeline also has to answer "does the output still meet the bar?" That extra question is what justifies eval-gating in CI. The first time you ship a prompt change without an eval and silently regress your top customer, you'll wish you'd set this up on day one.
The pipeline shape
Each step blocks the next. CI minutes are cheaper than customer churn.
A typical workflow file
# .github/workflows/ci.yaml
name: CI
on: [pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
- run: bun install --frozen-lockfile
- run: bun run lint
- run: bun run typecheck
- run: bun run test # unit + integration
evals:
needs: validate
runs-on: ubuntu-latest
if: contains(github.event.pull_request.changed_files, 'packages/prompts/')
steps:
- uses: actions/checkout@v4
- uses: oven-sh/setup-bun@v1
- run: bun install --frozen-lockfile
- run: bun run evals:affected --base=${{ github.event.pull_request.base.sha }}
env:
PORTKEY_API_KEY: ${{ secrets.PORTKEY_API_KEY }}
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
- name: Post eval delta to PR
run: bun run evals:post-comment
adversarial:
needs: evals
runs-on: ubuntu-latest
if: contains(github.event.pull_request.changed_files, 'packages/prompts/')
steps:
- run: bun run evals:adversarial
Reading it:
validateruns the cheap stuff (lint, types, unit, integration).evalsonly runs if the PR touchedpackages/prompts/.adversarialruns after evals pass. Anything failing blocks merge via branch protection.
Branch protection
Required on main:
- All CI jobs must pass.
- 1 reviewer for any PR touching
packages/prompts/. - 2 reviewers for any PR touching Tier-0 or Tier-1 prompts.
- Linear history (squash or rebase merges only).
- Signed commits if SOC 2 prep is on the roadmap.
The cost-aware CI rule
Eval suites cost real money — every case is an LLM call. Two rules to keep CI bills sane:
- Affected-only on PRs. Only run evals for prompts that changed. Full suite runs nightly on
main. - Spend cap on the CI gateway key. Set a daily budget (e.g., $50). If a runaway PR loop blows it, the next eval run fails fast with a "budget exceeded" error rather than ballooning to $4,000.
Typical eval cost for a healthy startup: $100–$500/month, split across CI runs and nightly full sweeps.
Preview deploys
Every PR gets a Vercel preview URL automatically. The preview:
- Hits real provider APIs (no mocks).
- Uses a sandbox tenant in a separate Supabase project.
- Has its own gateway key with low daily spend cap.
- Posts the preview URL as a PR comment.
Designer, PM, and AI engineer all click the preview before approving merge.
The merge → deploy flow
Once a PR merges to main:
- Auto-deploy to canary (5% of traffic). Vercel + feature-flag cohort handles this.
- 24-hour soak. Dashboards watch eval score on prod traces, p95 latency, cost/answer, error rate.
- Auto-promote to 25% if no alerts. Notify in Slack.
- Manual promote to 50% → 100% by the on-call engineer.
Any threshold breach during canary or soak → auto-rollback by flipping the cohort flag.
Deploy windows
- Production deploys: Monday–Thursday, 9am–4pm local. Avoid Friday afternoon and weekends.
- Tier-0 prompt changes: Tuesday or Wednesday only. Maximum daylight engineering hours for response if something goes wrong.
- Emergency hotfix: allowed any time, but requires a paged on-call engineer ready to monitor.
Hotfix path
When something breaks in prod and needs a fix now:
- Branch from
mainwith prefixhotfix/. - CI runs the full required suite — no shortcuts.
- Reviewer approves; merge.
- Deploy goes straight to 100% rather than canary (because the alternative is staying broken).
- Post-mortem within 48 hours.
If "we need to skip evals to ship this hotfix" comes up: the answer is no. If evals are the bottleneck on a hotfix, the eval suite is too slow — fix that separately.
A 25-person AI startup merges a prompt cleanup that passed the eval suite (no regression detected). Auto-deploys to 5% canary. During the 24h soak, the prod dashboard shows the LLM-as-judge score on real traffic drops 8 points.
Auto-rollback fires; the on-call engineer investigates. Finding: the eval set had 30 cases but didn't cover the specific input pattern that 12% of real users hit. The "clean" prompt narrowed handling for that pattern.
Fix: add 15 new cases to the eval set covering the missed pattern; reattempt the prompt change with the broader bar; ship clean. The canary system caught what the eval set didn't — that's why both layers exist.
One rule — "eval suite must pass before merge" — replaces dozens of process workarounds. No need for "AI review board," "weekly prompt review meeting," or "senior approval for any model change." The eval suite is the review. Process collapses into a number.
The teams that resist this rule (usually because their suite is slow or flaky) end up with much heavier human process to compensate. Fix the suite; the process disappears.
Common mistakes
- CI that takes 30 minutes. Engineers will skip it or context-switch and lose flow. Aim for under 12 minutes total. Cache aggressively (bun lockfile, Turbo cache, Docker layers).
- Allowing
--no-verifyfor prompt PRs. It's a culture-killer. Once one engineer does it, the discipline collapses. Block at the pre-commit hook and at branch protection. - No spend cap on the CI gateway key. A runaway eval loop in a PR bills $1,500 over a weekend. Always cap.
- No auto-rollback on the canary cohort. If the on-call engineer has to be paged and manually intervene every time, you'll suffer prolonged outages at night. Auto-rollback for threshold breaches is mandatory.
- Deploying Tier-0 changes on Friday at 4pm. Just don't.
What's next
→ Continue to Deployment where we cover feature flags, cohort rollouts, kill switches, and the AI-as-change-management reality.