The Development Loop
In one line: Open Jupyter (or a REPL), iterate the prompt against your 20-row eval CSV, commit each meaningful change. That's the entire day-to-day.
A solo AI project is not a normal "code → run → test" loop. Most of the work is prompt iteration. The right environment is a notebook or REPL where you can re-run the same prompt against the same 20 rows in two seconds, eyeball pass/fail, edit the prompt, repeat. Get this rhythm right and everything else flows from it.
→ Going deeper: The 20-row CSV here is the solo-scale version of a real eval discipline. When your project grows or you want to understand why this works — metrics, LLM-as-judge, statistical significance — see Chapter 5: Evaluation & Measurement.
The four-file v0 codebase
A solo AI project, at v0, is usually four files:
.
├── prompts/
│ └── main.md # The system prompt, in plain markdown
├── eval.csv # 20 hand-picked test rows from planning
├── eval.py # Runs the prompt against eval.csv, prints results
└── app/api/route.ts # (or main.py) — the production endpoint
That's it. No src/prompts/v3/system/, no Pydantic prompt classes, no abstract base prompts.
The eval script (Python, ~40 lines)
# eval.py
import csv, os
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-sonnet-4-5"
SYSTEM = open("prompts/main.md").read()
def run(user_input: str) -> str:
msg = client.messages.create(
model=MODEL,
max_tokens=1024,
system=SYSTEM,
messages=[{"role": "user", "content": user_input}],
)
return msg.content[0].text
def check(output: str, must_contain: str, must_not_contain: str) -> bool:
if must_contain and must_contain.lower() not in output.lower():
return False
if must_not_contain and must_not_contain.lower() in output.lower():
return False
return True
if __name__ == "__main__":
passed = failed = 0
with open("eval.csv") as f:
for row in csv.DictReader(f):
out = run(row["input"])
ok = check(out, row["expected_must_contain"], row["expected_must_not_contain"])
print(f"[{'PASS' if ok else 'FAIL'}] #{row['id']} {row['notes']}")
if not ok:
print(f" output: {out[:200]}...")
passed += int(ok); failed += int(not ok)
print(f"\n{passed} passed, {failed} failed")
Run with python eval.py. Total time per run: maybe 30 seconds for 20 rows. Cost: a couple cents. This is your entire eval system at v0. It will carry you well into real users.
The TypeScript equivalent
Same idea, ~50 lines, using the Vercel AI SDK:
// eval.ts
import { generateText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import fs from "node:fs";
import { parse } from "csv-parse/sync";
const SYSTEM = fs.readFileSync("prompts/main.md", "utf8");
const rows = parse(fs.readFileSync("eval.csv"), { columns: true });
let passed = 0, failed = 0;
for (const r of rows) {
const { text } = await generateText({
model: anthropic("claude-sonnet-4-5"),
system: SYSTEM,
prompt: r.input,
maxTokens: 1024,
});
const must = (r.expected_must_contain ?? "").toLowerCase();
const mustNot = (r.expected_must_not_contain ?? "").toLowerCase();
const ok =
(!must || text.toLowerCase().includes(must)) &&
(!mustNot || !text.toLowerCase().includes(mustNot));
console.log(`[${ok ? "PASS" : "FAIL"}] #${r.id} ${r.notes}`);
if (!ok) console.log(` output: ${text.slice(0, 200)}...`);
ok ? passed++ : failed++;
}
console.log(`\n${passed} passed, ${failed} failed`);
pnpm tsx eval.ts. Same vibes.
The inner loop
The cadence is:
- Open the prompt in
prompts/main.md. Edit it. - Run
python eval.py(ortsx eval.ts). 30 seconds. - Look at failures. Don't fix them all — pick the one that matters most.
- Edit the prompt to address it. Re-run.
- When all 20 pass (or you accept the trade-off),
git commit -m "prompt: handle empty input". - Push. Vercel/Modal auto-deploys. Try the live URL.
Total cycle: 2–5 minutes. You should be doing 10–20 of these per session.
For the messiest prompt iteration phase, a Jupyter notebook lets you keep the prompt and a few sample inputs in a cell and re-run individual cases without paying for all 20 every time. The workflow is the same — you just trade python eval.py for "Shift+Enter, look, edit, repeat."
Keep the notebook in notebooks/scratch.ipynb, gitignore the outputs (.ipynb outputs balloon diffs), and graduate findings into eval.csv rows when they're worth keeping.
Prompts in files, not in code
Keep prompts in .md files in prompts/, not as multi-line strings in your route handler. Three reasons:
- You'll iterate them more when they're in their own file. Editor doesn't fight you with TS escaping.
- Diffs are readable. A
git diffon a prompt change is plain English. - You can ship updates without redeploying if you read the file at runtime (or check in changes per deploy).
A typical layout:
prompts/
├── main.md # the system prompt
├── style-guide.md # voice/tone, included in main via {{include}}
└── examples.md # few-shot examples
A simple loader concatenates them in your code.
Production endpoint, one route
For Stack A, the production endpoint is a single route handler that streams the response:
// app/api/generate/route.ts
import { streamText } from "ai";
import { anthropic } from "@ai-sdk/anthropic";
import fs from "node:fs";
const SYSTEM = fs.readFileSync("prompts/main.md", "utf8");
export async function POST(req: Request) {
const { input } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-5"),
system: SYSTEM,
prompt: input,
maxTokens: 1024,
});
return result.toTextStreamResponse();
}
That's the whole backend. Add auth and rate-limit middleware next (auth page), but the LLM-call layer is this.
When to add real evals (LLM-as-judge)
For solo v0: don't. String matching catches 80% of regressions, and the eyeball-the-failures step catches the rest.
When you've shipped, have users, and notice quality drift you can't catch with substring matches, then add LLM-as-judge for one or two of your most important rows:
def llm_judge(output: str, criteria: str) -> bool:
prompt = f"Does the following output satisfy this criterion?\n\nCriterion: {criteria}\n\nOutput:\n{output}\n\nAnswer ONLY 'yes' or 'no'."
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=10,
messages=[{"role": "user", "content": prompt}],
)
return msg.content[0].text.strip().lower().startswith("y")
Add a judge_criteria column to eval.csv for the rows that need it. Don't run it on every row — judge calls double your eval cost.
At solo scale, the prompt is the code. Bugs are usually prompt bugs. Refactors are usually prompt refactors. Code review is usually prompt review. Treat prompts/main.md with the same care as your most important source file — small commits, clear messages, do not let a four-paragraph rewrite land as one "improvements" commit.
Common mistakes
- Editing the production prompt directly via the dashboard / runtime. You'll lose history and not be able to bisect a regression. The fix is to always edit
prompts/main.md, commit, deploy. - Running the eval against the production endpoint. It's slow, it costs more (one HTTP hop per call), and it conflates eval bugs with API bugs. The fix is to import the prompt and call the model directly from
eval.py. - Mutating
eval.csvto make the bar move. When a row fails, the temptation is to soften the expected substring. The fix is to fix the prompt or accept the failure as documented limitation — don't move the goalpost. - Skipping streaming "for now." A non-streaming chat UI feels broken at modern user expectations; users hit "stop" or refresh. The fix is
streamTextfrom day one — the Vercel AI SDK makes it cheaper than the non-streaming version. - Not pinning the model string. "I'll just use whatever's latest" → quality moves under you. The fix is a literal
claude-sonnet-4-5(or whichever exact version) in code. Upgrade deliberately, re-run evals, commit the bump.
Page checkpoint
Self-check:
- Do you have
prompts/main.md,eval.csv,eval.py(or.ts), and one production route — and nothing else in your v0 codebase? - Can you go from "prompt edit" → "deployed change" in under 5 minutes?
- Is your model string pinned to an exact version?
What's next
→ Continue to Auth where we'll lock the endpoint down before strangers find it.