Agentic RAG & memory
In one line: Classic RAG retrieves once and generates; agentic RAG lets the model decide when to search again, reformulate queries, and combine sources across steps — closer to how a researcher actually works.
One-shot RAG is like googling once and writing the essay. Agentic RAG is like a researcher who searches, reads, realizes the first query missed the point, searches again with better keywords, and only then answers. It costs more latency and tokens, but it wins on hard questions where a single retrieval pass returns the wrong chunks.
One-shot RAG vs. agentic RAG
| One-shot RAG | Agentic RAG | |
|---|---|---|
| Retrieval | Fixed pipeline: embed query → top-k → generate | Model chooses when and how to retrieve |
| Query | User question as-is | Model may rewrite, decompose, or multi-query |
| Failure mode | Wrong chunks → confident hallucination | Can retry retrieval before answering |
| Cost / latency | Lower | Higher (multiple search + LLM turns) |
| Best for | FAQ, support with clean docs | Research, legal, multi-hop factual tasks |
Foundations: RAG basics, hybrid search, reranking. Agentic RAG wraps those primitives in a loop — see agent loop.
Patterns that show up in production
Query decomposition — Break what is the revenue impact of policy X in region Y? into sub-queries (policy text, regional sales, comparable cases). Each sub-query gets its own retrieval pass; the harness merges results.
Self-critique before answer — After retrieval, the model asks: do these passages support a complete answer? If not, search again with a refined query. This is reflection applied to RAG — related to planning and reflection.
Tool-shaped retrieval — Expose search as a tool (search_docs, search_web, search_codebase) instead of silently prepending chunks. The model learns when retrieval helps vs. when it can answer from context.
Graph and structured hops — Some systems combine vector search with knowledge graphs or SQL: retrieve entities, follow edges, then vector-search related passages. The agentic layer decides which hop comes next.
Memory alongside retrieval
Retrieval pulls external knowledge; memory holds session and user state. Agentic systems combine both:
- Episodic — what happened in this task (tool results, prior sub-answers)
- Semantic — distilled facts worth reusing (user prefs, project glossary)
- Working — scratchpad the harness maintains outside the raw chat log
The harness must decide what to write to memory vs. what to re-retrieve each turn. Re-fetching docs when they may have changed; caching stable user prefs in memory.
If your eval shows one-shot RAG already hits faithfulness targets, adding agent loops adds cost and failure modes (runaway searches, tool spam) without benefit. Eval types tell you whether retrieval or generation is the bottleneck first.
Eval implications
Faithfulness metrics on the final answer are not enough. You also want:
- Retrieval recall per step — did the right doc appear in any pass?
- Search efficiency — how many passes before a correct answer?
- Citation coverage — does every factual claim trace to a retrieved chunk?
Harvey and Glean (see case studies) differ in domain, but both treat multi-step retrieval and ACL-aware search as core — not optional frosting.
→ Next: Trajectory & process evals