Stage 2 — Streaming chatbot

Time budget: ~1 week

In one line: Build the chat interface every "AI app" uses — streamed tokens, persisted history, a working stop button — so you understand the patterns before any framework hides them.

This stage is mostly about plumbing, not models. The model call itself is identical to Stage 1; the work is getting tokens to flow from provider → your server → the browser, keeping conversation state, and handling the UX details that look trivial but aren't (abort, stop button, scroll behavior).

In plain English

A chatbot UI is a list of messages and an input box. The hard parts: tokens should appear as they're generated (not all at once), users should be able to interrupt, history should survive a page refresh, and the back-end should be cheap to run. None of that is the LLM's problem; all of it is yours.

1. Pick a stack

Path	Stack	Why
TypeScript/web	Next.js 16 (App Router) + Vercel AI SDK (`useChat` hook)	One afternoon to working chat; the SDK handles streaming end-to-end
Python/web	FastAPI + SSE + plain HTML/JS	Most explicit; you'll see every piece
Python/no-frontend	Streamlit or Gradio	Zero frontend code; great for internal tools

This page shows the TypeScript path in full and sketches the Python path. The conceptual model is identical.

2. The architecture

The full message history is re-sent from the browser to the server every turn. The server adds it to the LLM call. The response streams back through.

3. The Next.js + Vercel AI SDK version

npx create-next-app@latest chat-app --typescript --tailwind --app
cd chat-app
npm install ai @ai-sdk/openai zod

Route handler — `app/api/chat/route.ts`

import { openai } from "@ai-sdk/openai";
import { streamText, convertToModelMessages, UIMessage } from "ai";

export const maxDuration = 30; // seconds

export async function POST(req: Request) {
  const { messages }: { messages: UIMessage[] } = await req.json();

  const result = streamText({
    model: openai("gpt-5-mini"),
    system: "You are a concise assistant. Answer in short paragraphs.",
    messages: convertToModelMessages(messages),
  });

  return result.toUIMessageStreamResponse();
}

That's the entire backend. Three lines of useful logic.

Frontend — `app/page.tsx`

"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export default function Chat() {
  const [input, setInput] = useState("");
  const { messages, sendMessage, status, stop } = useChat({
    transport: new DefaultChatTransport({ api: "/api/chat" }),
  });

  return (
    <div className="mx-auto max-w-xl p-4">
      <ul className="space-y-3 mb-4">
        {messages.map(m => (
          <li key={m.id} className={m.role === "user" ? "text-right" : "text-left"}>
            <span className="inline-block rounded-lg px-3 py-2 bg-slate-100">
              <strong>{m.role}:</strong>{" "}
              {m.parts.map((p, i) =>
                p.type === "text" ? <span key={i}>{p.text}</span> : null
              )}
            </span>
          </li>
        ))}
      </ul>

      <form
        onSubmit={(e) => {
          e.preventDefault();
          if (input.trim()) {
            sendMessage({ text: input });
            setInput("");
          }
        }}
        className="flex gap-2"
      >
        <input
          className="flex-1 rounded border px-3 py-2"
          value={input}
          onChange={(e) => setInput(e.target.value)}
          placeholder="Say something…"
          disabled={status === "streaming"}
        />
        {status === "streaming" ? (
          <button type="button" onClick={stop} className="rounded bg-red-500 px-4 text-white">
            Stop
          </button>
        ) : (
          <button type="submit" className="rounded bg-blue-500 px-4 text-white">
            Send
          </button>
        )}
      </form>
    </div>
  );
}

useChat handles the message-list state, the streaming transport, the abort controller for the stop button, and re-renders as tokens arrive. Until you've debugged streaming by hand, it's hard to appreciate what this hook is doing for you.

npm run dev, open localhost:3000, type. You should see tokens appear progressively. Hit Stop mid-stream — the abort signal cancels the upstream call to OpenAI, saving you the output tokens you didn't need.

4. The Python / FastAPI version (sketch)

# server.py
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import OpenAI

app = FastAPI()
client = OpenAI()

class ChatRequest(BaseModel):
    messages: list[dict]

@app.post("/api/chat")
async def chat(req: ChatRequest):
    def event_stream():
        stream = client.chat.completions.create(
            model="gpt-5-mini",
            messages=req.messages,
            stream=True,
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                yield f"data: {delta}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(event_stream(), media_type="text/event-stream")

<!-- index.html — tiny SSE consumer -->
<script>
async function send() {
  const msg = document.getElementById("input").value;
  history.push({ role: "user", content: msg });
  const res = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ messages: history }),
  });
  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let assistant = "";
  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const chunk = decoder.decode(value);
    // parse SSE — split on \n\n, strip "data: " prefix
    for (const line of chunk.split("\n\n")) {
      if (line.startsWith("data: ") && line !== "data: [DONE]") {
        assistant += line.slice(6);
        document.getElementById("out").textContent = assistant;
      }
    }
  }
  history.push({ role: "assistant", content: assistant });
}
</script>

Less polish, but every line is yours and inspectable.

5. What changed conceptually from Stage 1

Thing	Stage 1	Stage 2
Caller	Your script	Your browser → your server
Message history	Implicit (one call)	Explicit state object, replayed every turn
Output mode	Buffered (full string at end)	Streamed (token-by-token)
Abort	N/A	User can cancel mid-stream
Persistence	None	Belongs in a DB if you want it to survive refresh

6. The four UX details people forget

Auto-scroll on new tokens, but only if user is at the bottom

useEffect(() => {
  const el = scrollRef.current;
  if (!el) return;
  const nearBottom = el.scrollHeight - el.scrollTop - el.clientHeight < 100;
  if (nearBottom) el.scrollTop = el.scrollHeight;
}, [messages]);

If the user scrolled up to re-read a previous message, don't yank them back down. Subtle but critical.

Disable the send button while streaming

Otherwise users double-submit, which racks up tokens for nothing.

A real stop button

useChat gives you stop() — wire it to a visible button while status === "streaming". The abort cancels the upstream API call, saving tokens.

Show the "thinking" state

There's usually a 200ms–1.5s gap between Send and first token. Without a visible indicator, users hit Send again. A simple … or skeleton row is enough.

7. Persistence (when you want history across refreshes)

The simplest version: a conversations table.

CREATE TABLE messages (
  id          SERIAL PRIMARY KEY,
  conv_id     UUID,
  role        TEXT NOT NULL,
  content     TEXT NOT NULL,
  created_at  TIMESTAMPTZ DEFAULT now()
);

On each turn: append the user message, call the LLM, append the assistant message, return the conv_id back to the client. On page load: fetch messages for the conv_id and render.

For production persistence you'll also want: per-user scoping, soft delete, conversation-level metadata (model, system prompt at the time), and an index on (conv_id, created_at). All of which falls into Lifecycle territory.

8. Bonus: switch providers behind a flag

const provider = process.env.PROVIDER === "anthropic"
  ? anthropic("claude-haiku-4-5")
  : openai("gpt-5-mini");

const result = streamText({ model: provider, messages: convertToModelMessages(messages) });

The Vercel AI SDK abstracts provider differences — same streamText call, different model. (This is what frameworks buy you: provider-swappability with zero rewrite.) Stage 1 you saw the raw differences; now you can appreciate why the abstraction exists.

Where to go deeper

Vercel AI SDK docs — every hook, every transport, with live examples.
MDN: Server-Sent Events — the streaming protocol under the hood.
Streaming UX patterns — official OpenAI guidance.

Deeper in this guide

Foundations: Streaming — SSE vs WebSocket vs HTTP/2, when each makes sense.
Patterns: Streaming UX — the production tricks (resumability, partial JSON, etc.).
Stack: LLM SDKs — Vercel AI SDK vs raw provider SDK vs LiteLLM.

Project

Project — A real chat app you'd actually use

Build a chat app and deploy it somewhere — Vercel free tier is fine. Requirements: streaming tokens, conversation persistence (DB or localStorage is OK for now), a working stop button, a model-picker dropdown that swaps between at least two providers, and a tiny info row at the bottom of each assistant turn showing tokens-used and approximate cost. Use it yourself for a week. You'll find five UX bugs nobody else would have caught.

Common mistakes

Where people commonly trip up

Sending only the latest user message. The model has no memory between calls. If you only send the latest message, multi-turn breaks. Re-send the whole history on every call.
Forgetting to handle the abort. When the user clicks Stop, you should both cancel the SSE on the client and abort the upstream LLM call on the server. Otherwise you keep getting charged for tokens nobody sees.
Wiring auto-scroll naively. Yanking the user back to the bottom every time a token arrives is the worst UX in chat apps. Only auto-scroll if they're already near the bottom.
Persisting the system prompt in messages. When you save a conversation to a DB, save the system prompt as conversation metadata (so you know what behavior was in effect), not as a message in the history. Otherwise you'll mix system prompt versions across turns when you change it later.
Hitting maxDuration on a serverless function. Streaming a long response on a 10s-limit Lambda silently truncates. Configure maxDuration explicitly (Vercel allows 30–300s on different tiers), or move to a runtime without that limit (Cloudflare Workers, a long-lived Node server).

Page checkpoint

🤔 Quick checkQuick check

→ Next: Stage 3 — Structured output · Back to Part I overview

1. Pick a stack​

2. The architecture​

3. The Next.js + Vercel AI SDK version​

Route handler — app/api/chat/route.ts​

Frontend — app/page.tsx​

4. The Python / FastAPI version (sketch)​

5. What changed conceptually from Stage 1​

6. The four UX details people forget​

Auto-scroll on new tokens, but only if user is at the bottom​

Disable the send button while streaming​

A real stop button​

Show the "thinking" state​

7. Persistence (when you want history across refreshes)​

8. Bonus: switch providers behind a flag​

Where to go deeper​

Deeper in this guide​

Project​

Common mistakes​

Page checkpoint​