RAG basics

In one line: RAG = "before you answer, look stuff up." Retrieve the documents most relevant to the user's question, paste them into the prompt, ask the model to answer using them. Citeable, updateable, cheap.

In plain English

A bare LLM is like a smart student taking a test from memory — sometimes right, often confidently wrong on specifics. RAG is letting that student open a textbook during the test. Find the right pages, hand them over, ask the question. The student still has to read and reason, but now they're answering from real source material instead of guessing.

The minimal pipeline

User asks a question.
Retrieve the top K (typically 3–10) most relevant chunks from your knowledge base via vector / hybrid search.
Rerank the candidates (see reranking).
Prompt the LLM: "Answer the user's question using ONLY the context below. If the answer isn't in the context, say so." Then paste the chunks.
Generate.
Cite. Have the model emit which chunks it used; render those as sources in the UI.

That's it. The first version of every successful RAG app is about 200 lines of code.

Worked example: a 50-line RAG endpoint

SYSTEM = """You are a documentation assistant.
Answer ONLY from the provided context.
If the answer isn't there, say "I don't know from the docs."
Cite source chunks using [chunk_id] notation."""

def rag_answer(question: str) -> dict:
    # 1. Retrieve
    candidates = hybrid_search(question, k=30)
    # 2. Rerank
    top = co.rerank(model="rerank-english-v3.0", query=question,
                    documents=[c["text"] for c in candidates], top_n=5)
    chunks = [candidates[r.index] for r in top.results]
    
    # 3. Build context
    context = "\n\n".join(
        f"[{c['id']}] {c['text']}" for c in chunks
    )
    
    # 4. Generate
    resp = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": SYSTEM},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
        temperature=0,
    )
    return {
        "answer": resp.choices[0].message.content,
        "sources": [{"id": c["id"], "text": c["text"][:200]} for c in chunks],
    }

That's a working RAG endpoint. Add caching, rate limits, and observability and you're production.

Why RAG instead of fine-tuning

Updateable. New docs land tomorrow; RAG sees them tomorrow. Fine-tuning takes days and a rebuild.
Citeable. Users (and auditors) need to see where the answer came from. Fine-tuned models can't tell you.
Cheaper. No training run, no model hosting if you use a provider API.
Reversible. Bad retrieval is a config tweak. Bad fine-tune is a rollback.
Composable. Different users see different docs (multi-tenant), and the model never "leaks" doc A's content into doc B's tenant.

Fine-tuning still wins for: tone/style at scale, structured output the model can't already produce, and ultra-low-latency narrow tasks.

Where RAG goes wrong

Bad chunking. Most RAG quality problems are retrieval problems, and most retrieval problems are chunking problems. See chunking strategies.
No hybrid search. Pure vector loses on rare-term queries (SKUs, names, error codes). See hybrid search.
No re-ranking. Cheap retriever pulls 50 candidates → cross-encoder reranker picks the top 5. Big quality jump. See reranking.
No evals. You ship, things look fine in spot checks, then 20% of answers are subtly wrong. Build evals before you ship.
Citations that lie. Always make the model cite source IDs from the provided chunks; never accept invented URLs.
"Answer from context" prompt that the model ignores. Stronger phrasing helps; structured output (return an answer field and a cited_chunk_ids: list[str] field) helps more.

The 2026 RAG checklist

Worked example: enforcing citations with structured output

class RagAnswer(BaseModel):
    answer: str
    cited_chunk_ids: list[str]
    confidence: Literal["high", "medium", "low"]

resp = client.beta.chat.completions.parse(
    model="gpt-5-mini",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
    ],
    response_format=RagAnswer,
    temperature=0,
)
result = resp.choices[0].message.parsed

# Verify citations actually exist
chunk_ids_in_context = {c["id"] for c in chunks}
result.cited_chunk_ids = [c for c in result.cited_chunk_ids if c in chunk_ids_in_context]

The cited_chunk_ids field guarantees the model thinks about which sources it used. The post-check guarantees they're real.

Generation vs retrieval as separate quality questions

Two failure modes, two fixes:

Retrieval failed → improve chunking, hybrid, reranker.
Retrieval succeeded but generation failed → improve prompt, switch to a stronger model.

Don't conflate them. Always measure both separately. A retrieval recall@5 of 90% with a generation accuracy of 70% means you ship at 63% — better than either component's score suggests.

Variants you'll meet

Conversational RAG — multi-turn chat, where you also have to handle "what about that one?" follow-ups. Rewrite the user's query to be standalone before searching.
Multi-hop RAG — questions that need to chain multiple lookups. Often better solved as an agent with a search tool, not as one giant retrieval.
GraphRAG — build a knowledge graph from your docs, query it for entities, then retrieve. Heavy lift; useful when entity relationships matter (legal, biomedical).
Self-RAG / corrective RAG — model checks its own answer and re-retrieves if confidence is low. Useful when accuracy > latency.

What beginners get wrong

Common mistakes

Treating RAG as one box. It's four steps (chunk, embed/index, retrieve+rerank, generate). Every step has its own quality knobs.
Skipping evals. No measurement = no improvement. Spot checks lie.
Pasting raw chunks without IDs. Then the model can't cite, and you can't verify.
Letting the model answer when retrieval returned nothing relevant. It will confabulate. Detect zero/low-confidence retrieval and short-circuit to "I don't know."
Re-running retrieval on every conversational turn with the same query. Cache results; rewrite the query when it's a follow-up.
Treating chunk count as more = better. Past ~10 chunks, the model gets confused or "loses" some. Top-5 with reranking beats top-20 raw.
Forgetting freshness. If docs update, you need a re-index pipeline. Many RAGs ship answering questions from 6-month-old indexes.

Highlight: RAG is the default LLM architecture in 2026

If you're building an LLM feature on top of your own data — internal docs, knowledge bases, support archives, codebases — RAG is the starting point. Long context, fine-tuning, and agents are all alternatives you reach for after RAG has been measured and found insufficient.

→ Going deeper: For the production pipeline — hybrid search, reranking, metadata/tenant filtering, citation validation, and splitting retrieval vs generation evals — see The RAG pattern in production.

🤔 Quick checkQuick check

→ Next: Memory

The minimal pipeline
Worked example: a 50-line RAG endpoint
Why RAG instead of fine-tuning
Where RAG goes wrong
The 2026 RAG checklist
Worked example: enforcing citations with structured output
Generation vs retrieval as separate quality questions
Variants you'll meet
What beginners get wrong

The minimal pipeline​

Worked example: a 50-line RAG endpoint​

Why RAG instead of fine-tuning​

Where RAG goes wrong​

The 2026 RAG checklist​

Worked example: enforcing citations with structured output​

Generation vs retrieval as separate quality questions​

Variants you'll meet​

What beginners get wrong​