Agentic RAG & memory

In one line: Classic RAG retrieves once and generates; agentic RAG lets the model decide when to search again, reformulate queries, and combine sources across steps — closer to how a researcher actually works.

In plain English

One-shot RAG is like googling once and writing the essay. Agentic RAG is like a researcher who searches, reads, realizes the first query missed the point, searches again with better keywords, and only then answers. It costs more latency and tokens, but it wins on hard questions where a single retrieval pass returns the wrong chunks.

One-shot RAG vs. agentic RAG

	One-shot RAG	Agentic RAG
Retrieval	Fixed pipeline: embed query → top-k → generate	Model chooses when and how to retrieve
Query	User question as-is	Model may rewrite, decompose, or multi-query
Failure mode	Wrong chunks → confident hallucination	Can retry retrieval before answering
Cost / latency	Lower	Higher (multiple search + LLM turns)
Best for	FAQ, support with clean docs	Research, legal, multi-hop factual tasks

Foundations: RAG basics, hybrid search, reranking. Agentic RAG wraps those primitives in a loop — see agent loop.

Patterns that show up in production

Query decomposition — Break what is the revenue impact of policy X in region Y? into sub-queries (policy text, regional sales, comparable cases). Each sub-query gets its own retrieval pass; the harness merges results.

Self-critique before answer — After retrieval, the model asks: do these passages support a complete answer? If not, search again with a refined query. This is reflection applied to RAG — related to planning and reflection.

Tool-shaped retrieval — Expose search as a tool (search_docs, search_web, search_codebase) instead of silently prepending chunks. The model learns when retrieval helps vs. when it can answer from context.

Graph and structured hops — Some systems combine vector search with knowledge graphs or SQL: retrieve entities, follow edges, then vector-search related passages. The agentic layer decides which hop comes next.

Memory alongside retrieval

Retrieval pulls external knowledge; memory holds session and user state. Agentic systems combine both:

Episodic — what happened in this task (tool results, prior sub-answers)
Semantic — distilled facts worth reusing (user prefs, project glossary)
Working — scratchpad the harness maintains outside the raw chat log

The harness must decide what to write to memory vs. what to re-retrieve each turn. Re-fetching docs when they may have changed; caching stable user prefs in memory.

When not to use agentic RAG

If your eval shows one-shot RAG already hits faithfulness targets, adding agent loops adds cost and failure modes (runaway searches, tool spam) without benefit. Eval types tell you whether retrieval or generation is the bottleneck first.

Eval implications

Faithfulness metrics on the final answer are not enough. You also want:

Retrieval recall per step — did the right doc appear in any pass?
Search efficiency — how many passes before a correct answer?
Citation coverage — does every factual claim trace to a retrieved chunk?

Harvey and Glean (see case studies) differ in domain, but both treat multi-step retrieval and ACL-aware search as core — not optional frosting.

→ Next: Trajectory & process evals

🤔 Quick checkQuick check

One-shot RAG vs. agentic RAG​

Patterns that show up in production​

Memory alongside retrieval​

Eval implications​

One-shot RAG vs. agentic RAG

Patterns that show up in production

Memory alongside retrieval

Eval implications