CI/CD for AI Startups

In one line: Lint, type-check, unit + integration tests, eval suite, preview deploy, then cohort rollout. Every step blocks the next. Eval and adversarial suites are gates, not warnings.

In plain English

A regular CI pipeline answers "does the code work?" An AI CI pipeline also has to answer "does the output still meet the bar?" That extra question is what justifies eval-gating in CI. The first time you ship a prompt change without an eval and silently regress your top customer, you'll wish you'd set this up on day one.

The pipeline shape

Each step blocks the next. CI minutes are cheaper than customer churn.

A typical workflow file

# .github/workflows/ci.yaml
name: CI
on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v1
      - run: bun install --frozen-lockfile
      - run: bun run lint
      - run: bun run typecheck
      - run: bun run test           # unit + integration

  evals:
    needs: validate
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.changed_files, 'packages/prompts/')
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v1
      - run: bun install --frozen-lockfile
      - run: bun run evals:affected --base=${{ github.event.pull_request.base.sha }}
        env:
          PORTKEY_API_KEY: ${{ secrets.PORTKEY_API_KEY }}
          BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
      - name: Post eval delta to PR
        run: bun run evals:post-comment

  adversarial:
    needs: evals
    runs-on: ubuntu-latest
    if: contains(github.event.pull_request.changed_files, 'packages/prompts/')
    steps:
      - run: bun run evals:adversarial

Reading it: validate runs the cheap stuff (lint, types, unit, integration). evals only runs if the PR touched packages/prompts/. adversarial runs after evals pass. Anything failing blocks merge via branch protection.

Branch protection

Required on main:

All CI jobs must pass.
1 reviewer for any PR touching packages/prompts/.
2 reviewers for any PR touching Tier-0 or Tier-1 prompts.
Linear history (squash or rebase merges only).
Signed commits if SOC 2 prep is on the roadmap.

The cost-aware CI rule

Eval suites cost real money — every case is an LLM call. Two rules to keep CI bills sane:

Affected-only on PRs. Only run evals for prompts that changed. Full suite runs nightly on main.
Spend cap on the CI gateway key. Set a daily budget (e.g., $50). If a runaway PR loop blows it, the next eval run fails fast with a "budget exceeded" error rather than ballooning to $4,000.

Typical eval cost for a healthy startup: $100–$500/month, split across CI runs and nightly full sweeps.

Preview deploys

Every PR gets a Vercel preview URL automatically. The preview:

Hits real provider APIs (no mocks).
Uses a sandbox tenant in a separate Supabase project.
Has its own gateway key with low daily spend cap.
Posts the preview URL as a PR comment.

Designer, PM, and AI engineer all click the preview before approving merge.

The merge → deploy flow

Once a PR merges to main:

Auto-deploy to canary (5% of traffic). Vercel + feature-flag cohort handles this.
24-hour soak. Dashboards watch eval score on prod traces, p95 latency, cost/answer, error rate.
Auto-promote to 25% if no alerts. Notify in Slack.
Manual promote to 50% → 100% by the on-call engineer.

Any threshold breach during canary or soak → auto-rollback by flipping the cohort flag.

Deploy windows

Production deploys: Monday–Thursday, 9am–4pm local. Avoid Friday afternoon and weekends.
Tier-0 prompt changes: Tuesday or Wednesday only. Maximum daylight engineering hours for response if something goes wrong.
Emergency hotfix: allowed any time, but requires a paged on-call engineer ready to monitor.

Hotfix path

When something breaks in prod and needs a fix now:

Branch from main with prefix hotfix/.
CI runs the full required suite — no shortcuts.
Reviewer approves; merge.
Deploy goes straight to 100% rather than canary (because the alternative is staying broken).
Post-mortem within 48 hours.

If "we need to skip evals to ship this hotfix" comes up: the answer is no. If evals are the bottleneck on a hotfix, the eval suite is too slow — fix that separately.

Worked example: a canary rollout that caught a stealth regression

A 25-person AI startup merges a prompt cleanup that passed the eval suite (no regression detected). Auto-deploys to 5% canary. During the 24h soak, the prod dashboard shows the LLM-as-judge score on real traffic drops 8 points.

Auto-rollback fires; the on-call engineer investigates. Finding: the eval set had 30 cases but didn't cover the specific input pattern that 12% of real users hit. The "clean" prompt narrowed handling for that pattern.

Fix: add 15 new cases to the eval set covering the missed pattern; reattempt the prompt change with the broader bar; ship clean. The canary system caught what the eval set didn't — that's why both layers exist.

Highlight: eval gating in CI is the single highest-leverage rule

One rule — "eval suite must pass before merge" — replaces dozens of process workarounds. No need for "AI review board," "weekly prompt review meeting," or "senior approval for any model change." The eval suite is the review. Process collapses into a number.

The teams that resist this rule (usually because their suite is slow or flaky) end up with much heavier human process to compensate. Fix the suite; the process disappears.

Common mistakes

Where people commonly trip up

CI that takes 30 minutes. Engineers will skip it or context-switch and lose flow. Aim for under 12 minutes total. Cache aggressively (bun lockfile, Turbo cache, Docker layers).
Allowing --no-verify for prompt PRs. It's a culture-killer. Once one engineer does it, the discipline collapses. Block at the pre-commit hook and at branch protection.
No spend cap on the CI gateway key. A runaway eval loop in a PR bills $1,500 over a weekend. Always cap.
No auto-rollback on the canary cohort. If the on-call engineer has to be paged and manually intervene every time, you'll suffer prolonged outages at night. Auto-rollback for threshold breaches is mandatory.
Deploying Tier-0 changes on Friday at 4pm. Just don't.

🤔 Quick checkQuick check

What's next

→ Continue to Deployment where we cover feature flags, cohort rollouts, kill switches, and the AI-as-change-management reality.

The pipeline shape​

A typical workflow file​

Branch protection​

The cost-aware CI rule​

Preview deploys​

The merge → deploy flow​

Deploy windows​

Hotfix path​

Common mistakes​

What's next​