Skip to main content

Workflow comparison

In one line: Solo: edit, push, watch the dashboard. Startup: PRD → v0 → evals → cohort → 100%, two weeks. Enterprise: PRD → risk tier → security review → eval bar → POC → pilot → rollout, 1–3 months and 8–20 people.

In plain English

The same change moves through dramatically different choreography at each scale. A prompt tweak that takes a solo dev 90 seconds takes a startup 30 minutes and an enterprise 1–2 weeks. Almost none of that delta is the work — it's the waiting for nods, evals, deploy windows, and rollout soaks.

The choreography exists for a reason: it absorbs risk that the lower scales don't carry. The mistake is keeping a step after the risk it was designed to absorb is gone, or skipping a step before you've earned the speed.

Idea → live (a new AI feature)

StageSoloStartupEnterprise
SpecA sentence in your headOne-page PRDFull PRD + risk-tier classification + data-flow diagram
ApproachWhatever Claude suggestsQuick design discussion + eval sketchArchitecture review, prompt/RAG/fine-tune decision documented
v0A weekend hack3–5 day sprint behind a flagInternal POC, 2–4 weeks
Eval bar"Looks good on 5 prompts"Pass on 200-case eval suitePass on 5,000-case battery + bias + safety + red-team
RolloutPush to mainCohort rollout: 5% → 25% → 100% over a weekPilot tenant → expanded pilot → general availability, over months
People involved13–58–20
Calendar timeA weekend2 weeks1–3 months
IdeaBuildPRDv0PRDRisk tier

Prompt change

StepSoloStartupEnterprise
AuthorEdit file in IDEOpen PR with diffOpen PR + propose registry entry update
ReviewNoneOne teammateTech lead + (for High tier) AI safety partner
EvalRun Promptfoo locallyCI runs eval suite, blocks on regressionCI + extended battery + bias eval + nightly soak in staging
DeployPush to main → live in 60 secMerge → auto-deploy behind flag → cohort rolloutMerge → next deploy window → canary 1% → 10% → 100%
Auditgit loggit log + eval-platform diffPrompt registry version + audit log + change record

Provider change (e.g. swap primary model)

StepSoloStartupEnterprise
Triggering decisionRead a tweet about a new modelEval suite shows new model is better/cheaperCost forecast or strategic vendor decision
VettingNoneRun full eval suite on candidateVendor security review + DPIA + legal + procurement (3–9 months for a new vendor)
MechanismEdit one env varChange one line in gateway configUpdate gateway routing rule via change management
RolloutImmediateShadow traffic for a day → cohort cutoverShadow → pilot tenants → expanded → 100%, with rollback runbook
TimeAn evening1–2 weeks1–2 quarters

Incident response

StageSoloStartupEnterprise
DetectionA tweet, your dashboard, or "oh no"Synthetic eval / alert / customer reportSIEM correlation, automated SLO breach, multiple signals
TriageYou, immediatelyOn-call engineer opens incident channelIncident commander assigned, severity declared
MitigationFlip env var, redeployFlip kill switch in StatsigAuto-flip switches by cohort, manual confirm
CommsA tweetStatus page + customer emailStatus page + customer emails + executive brief + (sometimes) regulatory notification
Post-mortemNoneOne-pager in NotionFormal blameless report, action items tracked to closure, board summary if SEV1
Highlight: the cohort rollout is the most copy-able startup practice

If a startup adopts exactly one enterprise-style practice early, it should be the cohort rollout — pushing changes to 5% → 25% → 100% with a kill switch ready.

It costs almost nothing (a feature flag and a metric to watch), catches the majority of "looked great in eval, looks awful in prod" regressions, and gives you a clean rollback story. It's the highest-leverage process import from enterprise to startup; everything else (RFCs, risk tiers, registries) is way more expensive per unit of risk reduced.

Worked example: shipping a "summarize my emails" feature, three orgs
  • Solo: Friday night, weekend project. Hack a Next.js page + Vercel AI SDK + Claude Sonnet + a prompt. Eyeball it on 10 emails of your own. Deploy to a .app domain. Tweet about it Sunday night. Total: a weekend. Stakeholders: 1.
  • Startup: PM writes a one-page PRD on Monday. AI engineer builds v0 by Wednesday, behind a summarize_email flag. Eval suite (200 cases curated from real user inboxes, with PII redacted) runs in CI. Cohort rollout to 5% of beta users on Friday; expand to 25% Monday; 100% next Wednesday after dashboards stay green. Total: 2 weeks. Stakeholders: PM + AI engineer + reviewer + designer + customer-success lead = 5.
  • Enterprise: PRD goes to product council Q1. Risk-tier classification: Medium (touches user data). Security review (3 weeks) confirms email content can flow to the approved Bedrock endpoint. Eval bar requires passing the 5,000-case enterprise eval battery + a bias eval + a red-team session. POC built by the AI feature team over 6 weeks. Pilot to one internal team (2 weeks). Expanded pilot to 3 customers (4 weeks). GA in Q2. Total: ~4 months. Stakeholders: PM + 2 AI engineers + AI safety partner + security partner + legal + privacy + designer + product council + 3 pilot customers = 15+.

Same feature. Three different blast radii. Three appropriately-sized processes.

What stays the same / what changes

Stays the same: at every scale you go from idea → eval → deploy → watch. Every column has some kill switch. Every column has some version of "ship it behind a flag."

Changes: how many people are in the room at each step, how long each step takes, how much documentation each step produces, and how big a hole in the world a screw-up makes.

Eval-bar evolution by scale

The "what counts as a passing eval" bar is the most visible per-column ratchet in the development loop.

Stage of eval rigorSoloStartupEnterprise
Number of eval cases5–20100–5002,000–20,000
Where they come fromThe author's intuitionCurated from real user promptsCurated + adversarial + bias-targeted + red-team
Pass criteria"Looks good to me"Pass rate ≥ baseline on suitePass rate + no regression on safety + no bias delta beyond threshold
When they runManually before pushingOn every PR + nightly driftPre-merge + nightly + pre-release + post-incident + continuous on live traffic sample
Who owns the suiteThe authorThe AI team collectivelyA dedicated eval / quality function
Eval refresh cadenceWheneverMonthlyContinuous + post-incident additions

The pattern: the eval suite grows roughly with the blast radius. Solo doesn't need 5,000 cases because the cost of being wrong is one annoyed user. Enterprise can't ship with 200 cases because the cost of being wrong is a regulatory filing.

Common mistakes

  • Solo + cohort rollout. Setting up a 5%/25%/100% rollout for a feature only you and three friends use is procrastination dressed up as engineering. You don't have enough users to read the signal — ship to 100%, watch the dashboard, roll back if needed.
  • Startup + enterprise PRD. A 20-page PRD with risk-tier classification and a security section for a 2-week feature is the surest way to ship in 2 months instead. Use a one-pager, ship faster, learn faster.
  • Enterprise + "just push it." A staff engineer at a bank who skips the change-management process for a "tiny prompt tweak" is one screenshot away from a regulator's letter. The process exists for a reason; route around it only with senior cover and a written exception.
  • No kill switch at any scale. Even solo. The only acceptable answer to "what do you do if the model starts saying something it shouldn't?" is "flip this thing." Build it before you need it.
  • Confusing a feature flag with a kill switch. A flag that nobody on-call knows how to flip in production at 3am is not a kill switch — it's a config option. Make it explicit, document it, and drill it.
  • Treating the eval suite as static. An eval suite that doesn't grow after every real-world failure is decaying — the bugs you've already shipped will keep recurring. Every post-incident review should add at least one eval case.
🤔 Quick checkQuick check

→ Next: Economics comparison.