Skip to main content

Stage 2 — Streaming chatbot

Time budget: ~1 week

In one line: Build the chat interface every "AI app" uses — streamed tokens, persisted history, a working stop button — so you understand the patterns before any framework hides them.

This stage is mostly about plumbing, not models. The model call itself is identical to Stage 1; the work is getting tokens to flow from provider → your server → the browser, keeping conversation state, and handling the UX details that look trivial but aren't (abort, stop button, scroll behavior).

In plain English

A chatbot UI is a list of messages and an input box. The hard parts: tokens should appear as they're generated (not all at once), users should be able to interrupt, history should survive a page refresh, and the back-end should be cheap to run. None of that is the LLM's problem; all of it is yours.

1. Pick a stack

PathStackWhy
TypeScript/webNext.js 16 (App Router) + Vercel AI SDK (useChat hook)One afternoon to working chat; the SDK handles streaming end-to-end
Python/webFastAPI + SSE + plain HTML/JSMost explicit; you'll see every piece
Python/no-frontendStreamlit or GradioZero frontend code; great for internal tools

This page shows the TypeScript path in full and sketches the Python path. The conceptual model is identical.

2. The architecture

UserBrowserYour Server (route handler)Provider APItypes message, hitsEnterPOST /api/chat{messages: [...fullhistory...]}chat.completions.create({..., stream: true})SSE chunks (one token ata time)SSE / response streamappend delta to lastmessage (rendering astokens arrive)

The full message history is re-sent from the browser to the server every turn. The server adds it to the LLM call. The response streams back through.

3. The Next.js + Vercel AI SDK version

npx create-next-app@latest chat-app --typescript --tailwind --app
cd chat-app
npm install ai @ai-sdk/openai zod

Route handler — app/api/chat/route.ts

import { openai } from "@ai-sdk/openai";
import { streamText, convertToModelMessages, UIMessage } from "ai";

export const maxDuration = 30; // seconds

export async function POST(req: Request) {
const { messages }: { messages: UIMessage[] } = await req.json();

const result = streamText({
model: openai("gpt-5-mini"),
system: "You are a concise assistant. Answer in short paragraphs.",
messages: convertToModelMessages(messages),
});

return result.toUIMessageStreamResponse();
}

That's the entire backend. Three lines of useful logic.

Frontend — app/page.tsx

"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export default function Chat() {
const [input, setInput] = useState("");
const { messages, sendMessage, status, stop } = useChat({
transport: new DefaultChatTransport({ api: "/api/chat" }),
});

return (
<div className="mx-auto max-w-xl p-4">
<ul className="space-y-3 mb-4">
{messages.map(m => (
<li key={m.id} className={m.role === "user" ? "text-right" : "text-left"}>
<span className="inline-block rounded-lg px-3 py-2 bg-slate-100">
<strong>{m.role}:</strong>{" "}
{m.parts.map((p, i) =>
p.type === "text" ? <span key={i}>{p.text}</span> : null
)}
</span>
</li>
))}
</ul>

<form
onSubmit={(e) => {
e.preventDefault();
if (input.trim()) {
sendMessage({ text: input });
setInput("");
}
}}
className="flex gap-2"
>
<input
className="flex-1 rounded border px-3 py-2"
value={input}
onChange={(e) => setInput(e.target.value)}
placeholder="Say something…"
disabled={status === "streaming"}
/>
{status === "streaming" ? (
<button type="button" onClick={stop} className="rounded bg-red-500 px-4 text-white">
Stop
</button>
) : (
<button type="submit" className="rounded bg-blue-500 px-4 text-white">
Send
</button>
)}
</form>
</div>
);
}

useChat handles the message-list state, the streaming transport, the abort controller for the stop button, and re-renders as tokens arrive. Until you've debugged streaming by hand, it's hard to appreciate what this hook is doing for you.

npm run dev, open localhost:3000, type. You should see tokens appear progressively. Hit Stop mid-stream — the abort signal cancels the upstream call to OpenAI, saving you the output tokens you didn't need.

4. The Python / FastAPI version (sketch)

# server.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI()

class ChatRequest(BaseModel):
messages: list[dict]

@app.post("/api/chat")
async def chat(req: ChatRequest):
def event_stream():
stream = client.chat.completions.create(
model="gpt-5-mini",
messages=req.messages,
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
yield f"data: {delta}\n\n"
yield "data: [DONE]\n\n"

return StreamingResponse(event_stream(), media_type="text/event-stream")
<!-- index.html — tiny SSE consumer -->
<script>
async function send() {
const msg = document.getElementById("input").value;
history.push({ role: "user", content: msg });
const res = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ messages: history }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let assistant = "";
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
// parse SSE — split on \n\n, strip "data: " prefix
for (const line of chunk.split("\n\n")) {
if (line.startsWith("data: ") && line !== "data: [DONE]") {
assistant += line.slice(6);
document.getElementById("out").textContent = assistant;
}
}
}
history.push({ role: "assistant", content: assistant });
}
</script>

Less polish, but every line is yours and inspectable.

5. What changed conceptually from Stage 1

ThingStage 1Stage 2
CallerYour scriptYour browser → your server
Message historyImplicit (one call)Explicit state object, replayed every turn
Output modeBuffered (full string at end)Streamed (token-by-token)
AbortN/AUser can cancel mid-stream
PersistenceNoneBelongs in a DB if you want it to survive refresh

6. The four UX details people forget

Auto-scroll on new tokens, but only if user is at the bottom

useEffect(() => {
const el = scrollRef.current;
if (!el) return;
const nearBottom = el.scrollHeight - el.scrollTop - el.clientHeight < 100;
if (nearBottom) el.scrollTop = el.scrollHeight;
}, [messages]);

If the user scrolled up to re-read a previous message, don't yank them back down. Subtle but critical.

Disable the send button while streaming

Otherwise users double-submit, which racks up tokens for nothing.

A real stop button

useChat gives you stop() — wire it to a visible button while status === "streaming". The abort cancels the upstream API call, saving tokens.

Show the "thinking" state

There's usually a 200ms–1.5s gap between Send and first token. Without a visible indicator, users hit Send again. A simple or skeleton row is enough.

7. Persistence (when you want history across refreshes)

The simplest version: a conversations table.

CREATE TABLE messages (
id SERIAL PRIMARY KEY,
conv_id UUID,
role TEXT NOT NULL,
content TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);

On each turn: append the user message, call the LLM, append the assistant message, return the conv_id back to the client. On page load: fetch messages for the conv_id and render.

For production persistence you'll also want: per-user scoping, soft delete, conversation-level metadata (model, system prompt at the time), and an index on (conv_id, created_at). All of which falls into Lifecycle territory.

8. Bonus: switch providers behind a flag

const provider = process.env.PROVIDER === "anthropic"
? anthropic("claude-haiku-4-5")
: openai("gpt-5-mini");

const result = streamText({ model: provider, messages: convertToModelMessages(messages) });

The Vercel AI SDK abstracts provider differences — same streamText call, different model. (This is what frameworks buy you: provider-swappability with zero rewrite.) Stage 1 you saw the raw differences; now you can appreciate why the abstraction exists.

Where to go deeper

Deeper in this guide

Project

Project — A real chat app you'd actually use

Build a chat app and deploy it somewhere — Vercel free tier is fine. Requirements: streaming tokens, conversation persistence (DB or localStorage is OK for now), a working stop button, a model-picker dropdown that swaps between at least two providers, and a tiny info row at the bottom of each assistant turn showing tokens-used and approximate cost. Use it yourself for a week. You'll find five UX bugs nobody else would have caught.

Common mistakes

Where people commonly trip up
  • Sending only the latest user message. The model has no memory between calls. If you only send the latest message, multi-turn breaks. Re-send the whole history on every call.
  • Forgetting to handle the abort. When the user clicks Stop, you should both cancel the SSE on the client and abort the upstream LLM call on the server. Otherwise you keep getting charged for tokens nobody sees.
  • Wiring auto-scroll naively. Yanking the user back to the bottom every time a token arrives is the worst UX in chat apps. Only auto-scroll if they're already near the bottom.
  • Persisting the system prompt in messages. When you save a conversation to a DB, save the system prompt as conversation metadata (so you know what behavior was in effect), not as a message in the history. Otherwise you'll mix system prompt versions across turns when you change it later.
  • Hitting maxDuration on a serverless function. Streaming a long response on a 10s-limit Lambda silently truncates. Configure maxDuration explicitly (Vercel allows 30–300s on different tiers), or move to a runtime without that limit (Cloudflare Workers, a long-lived Node server).

Page checkpoint

🤔 Quick checkQuick check

Next: Stage 3 — Structured output · Back to Part I overview