Naive RAG Is Dead. Here's What Replaced It.
Most RAG pipelines retrieve garbage, stuff it into context, and pray. Agentic RAG replaces the prayer with a judge, a retry loop, and a routing layer that actually works.
You built a RAG pipeline. You chunked your documents, embedded them with text-embedding-3-large, stored them in Pinecone, and wired it all into your chatbot. The demo looked great. Your stakeholders were impressed.
Then real users showed up.
"What's our refund policy for enterprise contracts?" Your pipeline retrieves three chunks about consumer refunds, one chunk about enterprise pricing, and zero chunks about enterprise refund policies. The LLM, ever helpful, synthesizes a confident answer from the wrong context. Your user follows it. Legal calls you.
This is not a chunking problem. It is not an embedding model problem. It is an architecture problem. Naive RAG — retrieve once, generate once — is fundamentally broken for anything beyond toy demos. The industry has moved on. Here's what replaced it.
The Three Failures of Naive RAG
Every naive RAG pipeline fails in the same three ways:
1. Retrieval Misses
Embedding similarity is not semantic relevance. The query "enterprise contract refund policy" and the chunk "Enterprise agreements are subject to the terms outlined in Appendix B of the Master Service Agreement" have low cosine similarity despite being exactly what the user needs. The chunk uses none of the query's keywords. The embedding model has never seen your company's jargon.
2. Context Poisoning
Even when the right chunk is retrieved, it often arrives alongside irrelevant chunks that dilute or contradict it. The LLM cannot reliably distinguish "this chunk answers the question" from "this chunk is topically related but irrelevant." When you stuff five chunks into context and only one is relevant, you are asking the model to find a needle in a haystack you constructed.
3. No Feedback Loop
Naive RAG is open-loop. Retrieve, generate, return. There is no mechanism to detect that the retrieval was bad, that the generation hallucinated, or that the answer does not actually address the question. Every other engineering discipline has feedback loops. Naive RAG has a prayer.
The Retrieval Accuracy Problem
In benchmarks, top-5 retrieval accuracy for real-world enterprise corpora hovers between 40-65%. That means 35-60% of the time, the correct answer is not in the context window at all. No amount of prompt engineering fixes a missing context problem.
Agentic RAG: The Architecture
Agentic RAG replaces the linear retrieve-generate pipeline with an agent loop that can reason about its own retrieval quality and take corrective action.
┌──────────────┐
│ User Query │
└──────┬───────┘
│
┌──────▼───────┐
│ Query Router │──── Simple lookup? ──► Direct retrieval
└──────┬───────┘
│ Complex query
┌──────▼───────────┐
│ Query Decomposer │
│ (break into │
│ sub-queries) │
└──────┬────────────┘
│
┌──────▼───────┐
│ Retriever │◄──── Retry with
│ (hybrid │ rewritten query
│ search) │ ▲
└──────┬───────┘ │
│ │
┌──────▼───────┐ │
│ Retrieval │── Bad ───┘
│ Judge │
└──────┬───────┘
│ Good
┌──────▼───────┐
│ Generator │
└──────┬───────┘
│
┌──────▼───────┐
│ Answer │── Unsupported ──► Retry generation
│ Grounding │ or re-retrieve
│ Check │
└──────┬───────┘
│ Grounded
┌──────▼───────┐
│ Response │
└──────────────┘
Five components. Each one addresses a specific failure mode of naive RAG. Let's build them.
Component 1: Query Router
Not every query needs the full agentic pipeline. Simple factual lookups can go through fast retrieval. Complex, multi-part, or ambiguous queries need decomposition and judging.
import anthropic
import json
client = anthropic.Anthropic()
async def route_query(query: str) -> str:
"""Classify a query as simple or complex."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=100,
messages=[{
"role": "user",
"content": (
f"Classify this query as SIMPLE or COMPLEX.\n"
f"SIMPLE: single fact, direct lookup, unambiguous.\n"
f"COMPLEX: multi-part, requires reasoning across documents, "
f"ambiguous, or comparative.\n\n"
f"Query: {query}\n\n"
f"Respond with only SIMPLE or COMPLEX."
)
}]
)
classification = response.content[0].text.strip().upper()
return "simple" if "SIMPLE" in classification else "complex"This costs fractions of a cent per query and saves the full pipeline for queries that need it. In production, 40-60% of queries are simple lookups that do not need decomposition or judging.
Component 2: Query Decomposer
Complex queries need to be broken into sub-queries that each target a specific piece of information. This is where agentic RAG diverges most from naive RAG.
async def decompose_query(query: str) -> list[str]:
"""Break a complex query into retrieval-optimized sub-queries."""
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"Break this query into 2-4 independent sub-queries, each "
f"targeting a specific piece of information needed to answer "
f"the original query. Each sub-query should be self-contained "
f"and optimized for retrieval from a document store.\n\n"
f"Query: {query}\n\n"
f"Return a JSON array of strings. No explanation."
)
}]
)
return json.loads(response.content[0].text)For example, "How does our enterprise refund policy compare to the consumer policy, and when was each last updated?" becomes:
- "Enterprise contract refund policy terms and conditions"
- "Consumer refund policy terms and conditions"
- "Enterprise refund policy last updated date"
- "Consumer refund policy last updated date"
Each sub-query retrieves from a tighter semantic neighborhood than the original, compound query.
Component 3: Hybrid Retrieval
Embeddings alone miss keyword-critical matches. BM25 alone misses semantic matches. Use both.
from dataclasses import dataclass
@dataclass
class RetrievedChunk:
content: str
source: str
score: float
method: str # "semantic" or "keyword"
async def hybrid_retrieve(
query: str,
vector_store,
bm25_index,
top_k: int = 10
) -> list[RetrievedChunk]:
"""Retrieve using both semantic and keyword search, then fuse."""
# Semantic search via embeddings
semantic_results = await vector_store.query(
query=query,
top_k=top_k,
include_metadata=True
)
# Keyword search via BM25
keyword_results = bm25_index.search(query, top_k=top_k)
# Reciprocal Rank Fusion (RRF)
scores: dict[str, float] = {}
chunks: dict[str, RetrievedChunk] = {}
for rank, result in enumerate(semantic_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)
chunks[doc_id] = RetrievedChunk(
content=result["text"],
source=result["metadata"]["source"],
score=0, # will be updated
method="semantic"
)
for rank, result in enumerate(keyword_results):
doc_id = result.doc_id
scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)
if doc_id not in chunks:
chunks[doc_id] = RetrievedChunk(
content=result.text,
source=result.source,
score=0,
method="keyword"
)
else:
chunks[doc_id].method = "both"
# Sort by fused score
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
result = []
for doc_id, score in ranked[:top_k]:
chunk = chunks[doc_id]
chunk.score = score
result.append(chunk)
return resultReciprocal Rank Fusion (RRF) is dead simple: for each retrieval method, assign each result a score of 1 / (k + rank) where k=60 is a constant. Sum the scores across methods. Sort. Documents that appear in both lists get boosted; documents that only one method found still appear.
Why k=60?
The constant k=60 comes from the original RRF paper (Cormack et al., 2009). It controls how much lower-ranked results are penalized. k=60 is robust across most datasets. Do not tune it unless you have a retrieval evaluation set — which you should.
Component 4: Retrieval Judge
This is the critical component that naive RAG lacks entirely. After retrieval, an LLM evaluates whether the retrieved chunks actually contain the information needed to answer the query.
@dataclass
class JudgmentResult:
is_sufficient: bool
relevant_chunks: list[int] # indices of relevant chunks
missing_info: str | None # what's missing, if insufficient
suggested_requery: str | None # rewritten query to try
async def judge_retrieval(
query: str,
chunks: list[RetrievedChunk],
max_retries: int = 2
) -> tuple[list[RetrievedChunk], JudgmentResult]:
"""Judge whether retrieved chunks can answer the query."""
chunks_text = "\n\n".join(
f"[Chunk {i}] (source: {c.source})\n{c.content}"
for i, c in enumerate(chunks)
)
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"You are a retrieval quality judge. Given a query and "
f"retrieved chunks, determine:\n"
f"1. Which chunks are relevant (by index)\n"
f"2. Whether the relevant chunks contain sufficient "
f"information to answer the query\n"
f"3. If insufficient, what information is missing\n"
f"4. If insufficient, suggest a rewritten query\n\n"
f"Query: {query}\n\n"
f"Retrieved chunks:\n{chunks_text}\n\n"
f"Respond in JSON:\n"
f'{{"is_sufficient": bool, "relevant_chunks": [int], '
f'"missing_info": str|null, "suggested_requery": str|null}}'
)
}]
)
judgment = JudgmentResult(**json.loads(response.content[0].text))
# Filter to only relevant chunks
relevant = [chunks[i] for i in judgment.relevant_chunks if i < len(chunks)]
return relevant, judgmentThe judge serves two purposes:
- Filtering. It removes irrelevant chunks from context before generation, eliminating context poisoning.
- Retry signal. When retrieval is insufficient, it provides a rewritten query — a query specifically designed to find what was missing.
This turns retrieval from open-loop to closed-loop. The system can detect and recover from bad retrieval.
Component 5: Answer Grounding Check
Even with good retrieval, the generator can hallucinate. The grounding check verifies that every claim in the generated answer is supported by the retrieved chunks.
@dataclass
class GroundingResult:
is_grounded: bool
unsupported_claims: list[str]
confidence: float # 0-1
async def check_grounding(
answer: str,
chunks: list[RetrievedChunk],
query: str
) -> GroundingResult:
"""Verify that the answer is grounded in the retrieved chunks."""
context = "\n\n".join(c.content for c in chunks)
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": (
f"You are a grounding checker. Verify that every factual "
f"claim in the answer is directly supported by the provided "
f"context. Flag any claims that are not supported.\n\n"
f"Question: {query}\n\n"
f"Context:\n{context}\n\n"
f"Answer:\n{answer}\n\n"
f"Respond in JSON:\n"
f'{{"is_grounded": bool, "unsupported_claims": [str], '
f'"confidence": float}}'
)
}]
)
return GroundingResult(**json.loads(response.content[0].text))The Full Pipeline
Here is the complete agentic RAG pipeline, wiring all five components together:
async def agentic_rag(
query: str,
vector_store,
bm25_index,
max_retrieval_attempts: int = 3
) -> dict:
"""Full agentic RAG pipeline with routing, judging, and grounding."""
# Step 1: Route
complexity = await route_query(query)
if complexity == "simple":
# Fast path: retrieve and generate directly
chunks = await hybrid_retrieve(query, vector_store, bm25_index, top_k=5)
answer = await generate_answer(query, chunks)
return {"answer": answer, "sources": [c.source for c in chunks], "path": "simple"}
# Step 2: Decompose complex queries
sub_queries = await decompose_query(query)
# Step 3: Retrieve for each sub-query
all_chunks: list[RetrievedChunk] = []
for sub_q in sub_queries:
chunks = await hybrid_retrieve(sub_q, vector_store, bm25_index, top_k=5)
all_chunks.extend(chunks)
# Deduplicate by content
seen = set()
unique_chunks = []
for chunk in all_chunks:
if chunk.content not in seen:
seen.add(chunk.content)
unique_chunks.append(chunk)
# Step 4: Judge retrieval quality (with retry loop)
current_query = query
for attempt in range(max_retrieval_attempts):
relevant_chunks, judgment = await judge_retrieval(current_query, unique_chunks)
if judgment.is_sufficient:
break
if judgment.suggested_requery and attempt < max_retrieval_attempts - 1:
# Re-retrieve with the judge's suggested query
new_chunks = await hybrid_retrieve(
judgment.suggested_requery, vector_store, bm25_index, top_k=5
)
unique_chunks.extend(new_chunks)
current_query = judgment.suggested_requery
else:
# Exhausted retries — generate with best available context
relevant_chunks = unique_chunks[:10]
# Step 5: Generate answer
answer = await generate_answer(query, relevant_chunks)
# Step 6: Grounding check
grounding = await check_grounding(answer, relevant_chunks, query)
if not grounding.is_grounded and grounding.unsupported_claims:
# Regenerate with explicit grounding instruction
answer = await generate_answer(
query,
relevant_chunks,
system_suffix=(
"\n\nIMPORTANT: Only state facts directly supported by the "
"provided context. Do not infer or extrapolate. If the context "
"does not contain enough information, say so explicitly."
)
)
return {
"answer": answer,
"sources": list(set(c.source for c in relevant_chunks)),
"grounding_confidence": grounding.confidence,
"retrieval_attempts": attempt + 1,
"path": "complex"
}
async def generate_answer(
query: str,
chunks: list[RetrievedChunk],
system_suffix: str = ""
) -> str:
"""Generate an answer from retrieved chunks."""
context = "\n\n".join(
f"[Source: {c.source}]\n{c.content}" for c in chunks
)
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=4096,
system=(
"Answer the user's question based only on the provided context. "
"Cite sources when possible. If the context doesn't contain "
"enough information, say so." + system_suffix
),
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}]
)
return response.content[0].textMeasuring the Difference
The whole point of agentic RAG is measurable improvement. Here is how to build a retrieval evaluation set and track the metrics that matter.
@dataclass
class EvalCase:
query: str
expected_sources: list[str] # doc IDs that should be retrieved
expected_answer_contains: list[str] # key facts the answer must include
async def evaluate_pipeline(
eval_set: list[EvalCase],
pipeline_fn,
vector_store,
bm25_index
) -> dict:
"""Evaluate a RAG pipeline against a ground-truth eval set."""
metrics = {
"retrieval_recall": [], # did we find the right docs?
"answer_completeness": [], # did the answer include key facts?
"hallucination_rate": [], # did the answer include wrong facts?
"latency_ms": [],
"cost_usd": [],
}
for case in eval_set:
start = time.time()
result = await pipeline_fn(case.query, vector_store, bm25_index)
elapsed = (time.time() - start) * 1000
# Retrieval recall
retrieved_sources = set(result["sources"])
expected = set(case.expected_sources)
recall = len(retrieved_sources & expected) / len(expected) if expected else 1.0
metrics["retrieval_recall"].append(recall)
# Answer completeness
answer_lower = result["answer"].lower()
found = sum(1 for fact in case.expected_answer_contains if fact.lower() in answer_lower)
completeness = found / len(case.expected_answer_contains) if case.expected_answer_contains else 1.0
metrics["answer_completeness"].append(completeness)
metrics["latency_ms"].append(elapsed)
return {k: sum(v) / len(v) for k, v in metrics.items() if v}In production systems we have measured, agentic RAG with judging and retry improves retrieval recall from ~55% to ~82% and reduces hallucination rates by 3-4x compared to naive RAG on the same corpus.
The Cost Trade-Off
Agentic RAG uses 3-5x more LLM calls than naive RAG per query. On a simple query that routes through the fast path, the overhead is a single Haiku classification call (~$0.0001). On a complex query with decomposition, judging, and grounding, total cost is $0.02-0.08 per query. For most enterprise use cases, this is a rounding error compared to the cost of a wrong answer reaching a customer.
When to Use What
Not every application needs the full agentic pipeline. Here is the decision framework:
| Scenario | Architecture | Why |
|---|---|---|
| Internal search over well-structured docs | Naive RAG + reranker | Docs are clean, stakes are low, latency matters |
| Customer-facing Q&A | Agentic RAG (full pipeline) | Wrong answers erode trust, stakes are high |
| Code generation from docs | Agentic RAG + code validation | Generated code must compile and be correct |
| Multi-document synthesis | Agentic RAG + decomposition | Single retrieval cannot span multiple sources |
| Chat over a single PDF | Long-context model, no RAG | The whole document fits in context |
The last row is important. With 200K+ context windows now standard, many use cases that previously required RAG no longer do. If your entire corpus fits in context, RAG adds complexity without adding value.
The Eval Set Is the Product
Here is the uncomfortable truth: your RAG pipeline is only as good as your evaluation set. If you do not have a curated set of queries with known-good answers and expected source documents, you are flying blind.
Build your eval set from:
- Support tickets. Real user queries with verified answers from your support team.
- Edge cases. Queries that previously returned wrong answers.
- Adversarial queries. Queries designed to confuse retrieval (negations, comparisons, time-sensitive questions).
- Multi-hop queries. Questions that require combining information from multiple documents.
Start with 50 eval cases. Grow to 200. Run evals on every pipeline change. This is not optional — this is how you move from "it seems to work" to "it works."
The shift from naive to agentic RAG mirrors a broader pattern in AI engineering: moving from open-loop to closed-loop systems. Naive RAG is a feed-forward pipeline. Agentic RAG is a control system with sensors (the judge), actuators (re-retrieval), and a feedback loop (grounding checks). The same pattern applies to AI code generation, AI data analysis, and any LLM-powered workflow where correctness matters.
Stop Praying, Start Judging
Naive RAG was a reasonable starting point in 2023. In 2026, it is malpractice for any application where answer quality matters. The components described here — routing, decomposition, hybrid retrieval, judging, and grounding — are not research prototypes. They are production patterns running at scale across enterprise deployments today.
The cost of a wrong answer is almost always higher than the cost of three extra LLM calls. Build the judge. Close the loop. Stop praying.
Sources & Further Reading
- Cormack, Clarke, Buettcher, "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" (SIGIR 2009)
- Anthropic Claude API: Structured Output
- LlamaIndex: Agentic RAG
- Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2024)
- Pinecone: Hybrid Search with BM25
Was this article helpful?
Related Posts
AI Agents Keep Dying in Production. The Fix Was Invented in 1986.
Your agent framework handles the happy path. Erlang's supervision trees handled telecom uptime for 40 years. Here's how to apply the same 'let it crash' philosophy to make AI agents self-healing.
Read moreMCP, Agents, Skills, Subagents: The Definitive Guide to AI's New Building Blocks
Everyone's building with agents, MCP servers, skills, and subagents. Almost nobody can explain when to use which. This is the guide that fixes that — with architecture diagrams, production code, and a decision framework you can apply today.
Read moreDeerFlow 2.0: ByteDance Open-Sourced a Full-Stack SuperAgent. Here's the Complete Developer Guide.
ByteDance's DeerFlow 2.0 hit #1 on GitHub Trending with 39K stars in weeks. It's not another chatbot wrapper — it's a full-stack SuperAgent harness with sandboxed execution, persistent memory, sub-agents, and LangGraph orchestration. Here's everything you need to build with it.
Read moreComments
No comments yet. Be the first to share your thoughts!