Skip to main content

Agentic RAG & memory

In one line: Classic RAG retrieves once and generates; agentic RAG lets the model decide when to search again, reformulate queries, and combine sources across steps — closer to how a researcher actually works.

In plain English

One-shot RAG is like googling once and writing the essay. Agentic RAG is like a researcher who searches, reads, realizes the first query missed the point, searches again with better keywords, and only then answers. It costs more latency and tokens, but it wins on hard questions where a single retrieval pass returns the wrong chunks.

One-shot RAG vs. agentic RAG

One-shot RAGAgentic RAG
RetrievalFixed pipeline: embed query → top-k → generateModel chooses when and how to retrieve
QueryUser question as-isModel may rewrite, decompose, or multi-query
Failure modeWrong chunks → confident hallucinationCan retry retrieval before answering
Cost / latencyLowerHigher (multiple search + LLM turns)
Best forFAQ, support with clean docsResearch, legal, multi-hop factual tasks

Foundations: RAG basics, hybrid search, reranking. Agentic RAG wraps those primitives in a loop — see agent loop.

UserAgentRetrieverGeneratorComplex questionPlan sub-questionsSearch pass 1ChunksEnough evidence?Search pass 2 (reformulated)More chunksAnswer with citationsGrounded response

Patterns that show up in production

Query decomposition — Break what is the revenue impact of policy X in region Y? into sub-queries (policy text, regional sales, comparable cases). Each sub-query gets its own retrieval pass; the harness merges results.

Self-critique before answer — After retrieval, the model asks: do these passages support a complete answer? If not, search again with a refined query. This is reflection applied to RAG — related to planning and reflection.

Tool-shaped retrieval — Expose search as a tool (search_docs, search_web, search_codebase) instead of silently prepending chunks. The model learns when retrieval helps vs. when it can answer from context.

Graph and structured hops — Some systems combine vector search with knowledge graphs or SQL: retrieve entities, follow edges, then vector-search related passages. The agentic layer decides which hop comes next.

Memory alongside retrieval

Retrieval pulls external knowledge; memory holds session and user state. Agentic systems combine both:

  • Episodic — what happened in this task (tool results, prior sub-answers)
  • Semantic — distilled facts worth reusing (user prefs, project glossary)
  • Working — scratchpad the harness maintains outside the raw chat log

The harness must decide what to write to memory vs. what to re-retrieve each turn. Re-fetching docs when they may have changed; caching stable user prefs in memory.

When not to use agentic RAG

If your eval shows one-shot RAG already hits faithfulness targets, adding agent loops adds cost and failure modes (runaway searches, tool spam) without benefit. Eval types tell you whether retrieval or generation is the bottleneck first.

Eval implications

Faithfulness metrics on the final answer are not enough. You also want:

  • Retrieval recall per step — did the right doc appear in any pass?
  • Search efficiency — how many passes before a correct answer?
  • Citation coverage — does every factual claim trace to a retrieved chunk?

Harvey and Glean (see case studies) differ in domain, but both treat multi-step retrieval and ACL-aware search as core — not optional frosting.


→ Next: Trajectory & process evals

🤔 Quick checkQuick check