Environment Setup
In one line: Monorepo with a
prompts/package, anevals/package that runs in CI, gateway config checked into git, and secrets via Doppler or 1Password.
The single biggest predictor of an AI startup's iteration speed is whether prompt changes are first-class code. If your prompts live in random .md files, untracked Notion docs, or hardcoded strings, you'll regress silently and review will be painful. Treat prompts the same way you treat source code — versioned, reviewed, evaluated on every change.
Repository structure
my-ai-startup/
├── apps/
│ ├── web/ # Next.js app
│ └── worker/ # Python Modal/Render workers (optional)
├── packages/
│ ├── prompts/ # Versioned prompt library (TS or YAML)
│ │ ├── extract-clause.ts
│ │ ├── summarize-doc.ts
│ │ └── system-prompts.ts
│ ├── evals/ # Eval datasets + scripts
│ │ ├── datasets/
│ │ │ ├── extract-clause.jsonl
│ │ │ └── summarize-doc.jsonl
│ │ ├── runners/
│ │ └── ci.ts # Entry point for CI
│ ├── llm/ # Gateway client + retry/fallback
│ ├── db/ # Drizzle schema + pgvector helpers
│ └── ui/ # Shared shadcn components + AI primitives
├── gateway/
│ └── portkey-config.yaml # Gateway routing + fallback config
├── .env.example # Sacred — every env var the app reads
└── turbo.json
Reading this layout:
packages/prompts/is where every prompt lives, in TypeScript or YAML, versioned in git.packages/evals/holds the eval datasets and the script CI runs on every PR.packages/llm/wraps the gateway client with retry/fallback logic.gateway/portkey-config.yamlis checked-in so gateway changes go through code review like any other config.
The prompt library pattern
Every prompt is a typed function, not a magic string:
// packages/prompts/extract-clause.ts
import { z } from "zod";
export const ExtractClauseInput = z.object({
contractText: z.string(),
clauseType: z.enum(["indemnification", "termination", "ip"]),
});
export const ExtractClauseOutput = z.object({
found: z.boolean(),
text: z.string().optional(),
confidence: z.number().min(0).max(1),
});
export const extractClausePrompt = (input: z.infer<typeof ExtractClauseInput>) => ({
system: `You are a legal-clause extractor. Return strict JSON matching the schema.`,
user: `Extract the ${input.clauseType} clause from:\n\n${input.contractText}`,
// Pinned model version, structured output, eval set linked
model: "claude-sonnet-4.5-20251022",
schema: ExtractClauseOutput,
evalSuite: "extract-clause-v3",
});
Now every caller is type-safe, the prompt is reviewable, model version is pinned (no silent upgrades), and the eval suite name is linked from the prompt itself.
Eval scripts in CI
Every PR that changes a file under packages/prompts/ triggers the relevant eval suite. Example GitHub Actions step:
- name: Run affected eval suites
run: bun run evals:affected --base=${{ github.event.pull_request.base.sha }}
env:
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
PORTKEY_API_KEY: ${{ secrets.PORTKEY_API_KEY }}
Rules:
- Eval runtime budget: 8 minutes for the affected suite. Faster than that and engineers will actually wait for it; slower and they'll skip.
- A regression on any case blocks merge.
- The eval result is posted as a PR comment with a diff vs
main. - The full suite (all features) runs nightly against
main.
Gateway configuration as code
gateway/portkey-config.yaml lives in the repo:
strategy:
mode: fallback
targets:
- provider: anthropic
api_key: $ANTHROPIC_API_KEY
weight: 1.0
- provider: openai
api_key: $OPENAI_API_KEY
weight: 1.0
retry:
attempts: 2
on_status: [429, 500, 502, 503, 504]
cache:
mode: semantic
ttl: 3600
PRs to this file go through review like any other change. No more "someone changed gateway routing in the dashboard and nobody knows when or why."
Three environments
| Environment | Use | Models | Eval gating |
|---|---|---|---|
| Local | Per-dev machine | Real provider keys, low-spend personal accounts | Pre-commit subset (~30s) |
| Preview | Per-PR Vercel deploy | Real providers, sandbox tenant data | Full affected suite (8m) |
| Production | The real thing | Pinned model versions | Nightly full suite |
Secrets
- Doppler or 1Password Developer for syncing across the team.
.env.exampleis the contract: every variable the app reads appears with a placeholder.- Critical AI keys (
ANTHROPIC_API_KEY,OPENAI_API_KEY,PORTKEY_API_KEY,BRAINTRUST_API_KEY) live in Doppler, synced to Vercel and Modal. - Per-engineer personal LLM keys in dev — separate from prod — so a dev infinite-loop doesn't bill the company $4K.
The onboarding script
#!/usr/bin/env bash
set -e
echo "Installing Bun..."
curl -fsSL https://bun.sh/install | bash
echo "Installing deps..."
bun install
echo "Pulling secrets from Doppler..."
doppler setup
doppler secrets download --no-file --format env > .env.local
echo "Spinning up local Postgres + pgvector..."
docker-compose up -d
bun run db:migrate && bun run db:seed
echo "Running a smoke eval against your dev keys..."
bun run evals:smoke
echo "Starting dev server..."
bun run dev
End of day one bar: app running locally, dev keys configured, smoke eval passed, one merged PR. If a new hire can't hit this, the onboarding script is broken.
A 15-person AI startup had prompts as inline template strings inside route handlers. An engineer "cleaned up" a system prompt during refactoring — no eval ran because the file looked like a route file, not a prompt file. Three days later, support tickets spike: the assistant now refuses certain valid queries because the cleaned-up prompt accidentally narrowed its scope.
Root cause: prompts buried inside non-prompt files were invisible to the eval-gating logic. Fix: every prompt moved to packages/prompts/. CI rule: changes there trigger evals. Total fix time: one afternoon. Total avoided: weeks of recurring stealth regressions.
The number-one cause of "we spent $8,000 in an afternoon" is an engineer's local script using the production API key in a runaway loop. Doppler can serve different secrets to dev and prod environments. Set this up before the first new engineer joins.
Common mistakes
- Storing prompts in a Notion page. Notion changes aren't reviewed. Notion changes don't trigger evals. By month three the Notion version doesn't match the code. Always: prompts in git.
- Eval suite that takes 25 minutes. Nobody waits for it. They merge anyway, eval fails on
main, the alert is ignored. Aim for under 8 minutes; subset aggressively for the affected suite. - Sharing one staging vector index across all preview environments. Two engineers run different embedding jobs into the same index and start seeing each other's data. Use per-PR or per-engineer namespaces.
- Letting the gateway config drift in the dashboard. Three people edit Portkey UI; nobody documents it. Move config to code immediately.
- No smoke eval in onboarding. Day one of every new hire should include a successful eval run against their own dev key. If it doesn't work, that engineer's onboarding gets stuck on real production incidents two weeks later.
What's next
→ Continue to Development Loop where we cover the daily prompt iteration rhythm, PR review for AI changes, and preview deploys.