AIStackInsightsAIStackInsights
HomeBlogCategoriesAboutNewsletter
AIStackInsightsAIStackInsights

Practical AI insights — LLMs, machine learning, prompt engineering, and the tools shaping the future.

Content

  • All Posts
  • LLMs
  • Tutorials
  • AI Tools

Company

  • About
  • Newsletter
  • RSS Feed

Connect

© 2026 AIStackInsights. All rights reserved.

Tutorials

Naive RAG Is Dead. Here's What Replaced It.

Most RAG pipelines retrieve garbage, stuff it into context, and pray. Agentic RAG replaces the prayer with a judge, a retry loop, and a routing layer that actually works.

AIStackInsights TeamApril 9, 202614 min read
ragai-agentsretrievalpythonarchitecturellms

You built a RAG pipeline. You chunked your documents, embedded them with text-embedding-3-large, stored them in Pinecone, and wired it all into your chatbot. The demo looked great. Your stakeholders were impressed.

Then real users showed up.

"What's our refund policy for enterprise contracts?" Your pipeline retrieves three chunks about consumer refunds, one chunk about enterprise pricing, and zero chunks about enterprise refund policies. The LLM, ever helpful, synthesizes a confident answer from the wrong context. Your user follows it. Legal calls you.

This is not a chunking problem. It is not an embedding model problem. It is an architecture problem. Naive RAG — retrieve once, generate once — is fundamentally broken for anything beyond toy demos. The industry has moved on. Here's what replaced it.

The Three Failures of Naive RAG

Every naive RAG pipeline fails in the same three ways:

1. Retrieval Misses

Embedding similarity is not semantic relevance. The query "enterprise contract refund policy" and the chunk "Enterprise agreements are subject to the terms outlined in Appendix B of the Master Service Agreement" have low cosine similarity despite being exactly what the user needs. The chunk uses none of the query's keywords. The embedding model has never seen your company's jargon.

2. Context Poisoning

Even when the right chunk is retrieved, it often arrives alongside irrelevant chunks that dilute or contradict it. The LLM cannot reliably distinguish "this chunk answers the question" from "this chunk is topically related but irrelevant." When you stuff five chunks into context and only one is relevant, you are asking the model to find a needle in a haystack you constructed.

3. No Feedback Loop

Naive RAG is open-loop. Retrieve, generate, return. There is no mechanism to detect that the retrieval was bad, that the generation hallucinated, or that the answer does not actually address the question. Every other engineering discipline has feedback loops. Naive RAG has a prayer.

The Retrieval Accuracy Problem

In benchmarks, top-5 retrieval accuracy for real-world enterprise corpora hovers between 40-65%. That means 35-60% of the time, the correct answer is not in the context window at all. No amount of prompt engineering fixes a missing context problem.

Agentic RAG: The Architecture

Agentic RAG replaces the linear retrieve-generate pipeline with an agent loop that can reason about its own retrieval quality and take corrective action.

         ┌──────────────┐
         │  User Query  │
         └──────┬───────┘
                │
         ┌──────▼───────┐
         │  Query Router │──── Simple lookup? ──► Direct retrieval
         └──────┬───────┘
                │ Complex query
         ┌──────▼───────────┐
         │  Query Decomposer │
         │  (break into      │
         │   sub-queries)    │
         └──────┬────────────┘
                │
         ┌──────▼───────┐
         │  Retriever    │◄──── Retry with
         │  (hybrid      │      rewritten query
         │   search)     │           ▲
         └──────┬───────┘           │
                │                   │
         ┌──────▼───────┐          │
         │  Retrieval    │── Bad ───┘
         │  Judge        │
         └──────┬───────┘
                │ Good
         ┌──────▼───────┐
         │  Generator    │
         └──────┬───────┘
                │
         ┌──────▼───────┐
         │  Answer       │── Unsupported ──► Retry generation
         │  Grounding    │                    or re-retrieve
         │  Check        │
         └──────┬───────┘
                │ Grounded
         ┌──────▼───────┐
         │  Response     │
         └──────────────┘

Five components. Each one addresses a specific failure mode of naive RAG. Let's build them.

Component 1: Query Router

Not every query needs the full agentic pipeline. Simple factual lookups can go through fast retrieval. Complex, multi-part, or ambiguous queries need decomposition and judging.

import anthropic
import json
 
client = anthropic.Anthropic()
 
async def route_query(query: str) -> str:
    """Classify a query as simple or complex."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": (
                f"Classify this query as SIMPLE or COMPLEX.\n"
                f"SIMPLE: single fact, direct lookup, unambiguous.\n"
                f"COMPLEX: multi-part, requires reasoning across documents, "
                f"ambiguous, or comparative.\n\n"
                f"Query: {query}\n\n"
                f"Respond with only SIMPLE or COMPLEX."
            )
        }]
    )
    classification = response.content[0].text.strip().upper()
    return "simple" if "SIMPLE" in classification else "complex"

This costs fractions of a cent per query and saves the full pipeline for queries that need it. In production, 40-60% of queries are simple lookups that do not need decomposition or judging.

Component 2: Query Decomposer

Complex queries need to be broken into sub-queries that each target a specific piece of information. This is where agentic RAG diverges most from naive RAG.

async def decompose_query(query: str) -> list[str]:
    """Break a complex query into retrieval-optimized sub-queries."""
    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"Break this query into 2-4 independent sub-queries, each "
                f"targeting a specific piece of information needed to answer "
                f"the original query. Each sub-query should be self-contained "
                f"and optimized for retrieval from a document store.\n\n"
                f"Query: {query}\n\n"
                f"Return a JSON array of strings. No explanation."
            )
        }]
    )
    return json.loads(response.content[0].text)

For example, "How does our enterprise refund policy compare to the consumer policy, and when was each last updated?" becomes:

  1. "Enterprise contract refund policy terms and conditions"
  2. "Consumer refund policy terms and conditions"
  3. "Enterprise refund policy last updated date"
  4. "Consumer refund policy last updated date"

Each sub-query retrieves from a tighter semantic neighborhood than the original, compound query.

Component 3: Hybrid Retrieval

Embeddings alone miss keyword-critical matches. BM25 alone misses semantic matches. Use both.

from dataclasses import dataclass
 
@dataclass
class RetrievedChunk:
    content: str
    source: str
    score: float
    method: str  # "semantic" or "keyword"
 
 
async def hybrid_retrieve(
    query: str,
    vector_store,
    bm25_index,
    top_k: int = 10
) -> list[RetrievedChunk]:
    """Retrieve using both semantic and keyword search, then fuse."""
    # Semantic search via embeddings
    semantic_results = await vector_store.query(
        query=query,
        top_k=top_k,
        include_metadata=True
    )
 
    # Keyword search via BM25
    keyword_results = bm25_index.search(query, top_k=top_k)
 
    # Reciprocal Rank Fusion (RRF)
    scores: dict[str, float] = {}
    chunks: dict[str, RetrievedChunk] = {}
 
    for rank, result in enumerate(semantic_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)
        chunks[doc_id] = RetrievedChunk(
            content=result["text"],
            source=result["metadata"]["source"],
            score=0,  # will be updated
            method="semantic"
        )
 
    for rank, result in enumerate(keyword_results):
        doc_id = result.doc_id
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)
        if doc_id not in chunks:
            chunks[doc_id] = RetrievedChunk(
                content=result.text,
                source=result.source,
                score=0,
                method="keyword"
            )
        else:
            chunks[doc_id].method = "both"
 
    # Sort by fused score
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    result = []
    for doc_id, score in ranked[:top_k]:
        chunk = chunks[doc_id]
        chunk.score = score
        result.append(chunk)
 
    return result

Reciprocal Rank Fusion (RRF) is dead simple: for each retrieval method, assign each result a score of 1 / (k + rank) where k=60 is a constant. Sum the scores across methods. Sort. Documents that appear in both lists get boosted; documents that only one method found still appear.

Why k=60?

The constant k=60 comes from the original RRF paper (Cormack et al., 2009). It controls how much lower-ranked results are penalized. k=60 is robust across most datasets. Do not tune it unless you have a retrieval evaluation set — which you should.

Component 4: Retrieval Judge

This is the critical component that naive RAG lacks entirely. After retrieval, an LLM evaluates whether the retrieved chunks actually contain the information needed to answer the query.

@dataclass
class JudgmentResult:
    is_sufficient: bool
    relevant_chunks: list[int]  # indices of relevant chunks
    missing_info: str | None    # what's missing, if insufficient
    suggested_requery: str | None  # rewritten query to try
 
 
async def judge_retrieval(
    query: str,
    chunks: list[RetrievedChunk],
    max_retries: int = 2
) -> tuple[list[RetrievedChunk], JudgmentResult]:
    """Judge whether retrieved chunks can answer the query."""
    chunks_text = "\n\n".join(
        f"[Chunk {i}] (source: {c.source})\n{c.content}"
        for i, c in enumerate(chunks)
    )
 
    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"You are a retrieval quality judge. Given a query and "
                f"retrieved chunks, determine:\n"
                f"1. Which chunks are relevant (by index)\n"
                f"2. Whether the relevant chunks contain sufficient "
                f"information to answer the query\n"
                f"3. If insufficient, what information is missing\n"
                f"4. If insufficient, suggest a rewritten query\n\n"
                f"Query: {query}\n\n"
                f"Retrieved chunks:\n{chunks_text}\n\n"
                f"Respond in JSON:\n"
                f'{{"is_sufficient": bool, "relevant_chunks": [int], '
                f'"missing_info": str|null, "suggested_requery": str|null}}'
            )
        }]
    )
 
    judgment = JudgmentResult(**json.loads(response.content[0].text))
 
    # Filter to only relevant chunks
    relevant = [chunks[i] for i in judgment.relevant_chunks if i < len(chunks)]
 
    return relevant, judgment

The judge serves two purposes:

  1. Filtering. It removes irrelevant chunks from context before generation, eliminating context poisoning.
  2. Retry signal. When retrieval is insufficient, it provides a rewritten query — a query specifically designed to find what was missing.

This turns retrieval from open-loop to closed-loop. The system can detect and recover from bad retrieval.

Component 5: Answer Grounding Check

Even with good retrieval, the generator can hallucinate. The grounding check verifies that every claim in the generated answer is supported by the retrieved chunks.

@dataclass
class GroundingResult:
    is_grounded: bool
    unsupported_claims: list[str]
    confidence: float  # 0-1
 
 
async def check_grounding(
    answer: str,
    chunks: list[RetrievedChunk],
    query: str
) -> GroundingResult:
    """Verify that the answer is grounded in the retrieved chunks."""
    context = "\n\n".join(c.content for c in chunks)
 
    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                f"You are a grounding checker. Verify that every factual "
                f"claim in the answer is directly supported by the provided "
                f"context. Flag any claims that are not supported.\n\n"
                f"Question: {query}\n\n"
                f"Context:\n{context}\n\n"
                f"Answer:\n{answer}\n\n"
                f"Respond in JSON:\n"
                f'{{"is_grounded": bool, "unsupported_claims": [str], '
                f'"confidence": float}}'
            )
        }]
    )
 
    return GroundingResult(**json.loads(response.content[0].text))

The Full Pipeline

Here is the complete agentic RAG pipeline, wiring all five components together:

async def agentic_rag(
    query: str,
    vector_store,
    bm25_index,
    max_retrieval_attempts: int = 3
) -> dict:
    """Full agentic RAG pipeline with routing, judging, and grounding."""
 
    # Step 1: Route
    complexity = await route_query(query)
 
    if complexity == "simple":
        # Fast path: retrieve and generate directly
        chunks = await hybrid_retrieve(query, vector_store, bm25_index, top_k=5)
        answer = await generate_answer(query, chunks)
        return {"answer": answer, "sources": [c.source for c in chunks], "path": "simple"}
 
    # Step 2: Decompose complex queries
    sub_queries = await decompose_query(query)
 
    # Step 3: Retrieve for each sub-query
    all_chunks: list[RetrievedChunk] = []
    for sub_q in sub_queries:
        chunks = await hybrid_retrieve(sub_q, vector_store, bm25_index, top_k=5)
        all_chunks.extend(chunks)
 
    # Deduplicate by content
    seen = set()
    unique_chunks = []
    for chunk in all_chunks:
        if chunk.content not in seen:
            seen.add(chunk.content)
            unique_chunks.append(chunk)
 
    # Step 4: Judge retrieval quality (with retry loop)
    current_query = query
    for attempt in range(max_retrieval_attempts):
        relevant_chunks, judgment = await judge_retrieval(current_query, unique_chunks)
 
        if judgment.is_sufficient:
            break
 
        if judgment.suggested_requery and attempt < max_retrieval_attempts - 1:
            # Re-retrieve with the judge's suggested query
            new_chunks = await hybrid_retrieve(
                judgment.suggested_requery, vector_store, bm25_index, top_k=5
            )
            unique_chunks.extend(new_chunks)
            current_query = judgment.suggested_requery
    else:
        # Exhausted retries — generate with best available context
        relevant_chunks = unique_chunks[:10]
 
    # Step 5: Generate answer
    answer = await generate_answer(query, relevant_chunks)
 
    # Step 6: Grounding check
    grounding = await check_grounding(answer, relevant_chunks, query)
 
    if not grounding.is_grounded and grounding.unsupported_claims:
        # Regenerate with explicit grounding instruction
        answer = await generate_answer(
            query,
            relevant_chunks,
            system_suffix=(
                "\n\nIMPORTANT: Only state facts directly supported by the "
                "provided context. Do not infer or extrapolate. If the context "
                "does not contain enough information, say so explicitly."
            )
        )
 
    return {
        "answer": answer,
        "sources": list(set(c.source for c in relevant_chunks)),
        "grounding_confidence": grounding.confidence,
        "retrieval_attempts": attempt + 1,
        "path": "complex"
    }
 
 
async def generate_answer(
    query: str,
    chunks: list[RetrievedChunk],
    system_suffix: str = ""
) -> str:
    """Generate an answer from retrieved chunks."""
    context = "\n\n".join(
        f"[Source: {c.source}]\n{c.content}" for c in chunks
    )
    response = client.messages.create(
        model="claude-sonnet-4-6-20250514",
        max_tokens=4096,
        system=(
            "Answer the user's question based only on the provided context. "
            "Cite sources when possible. If the context doesn't contain "
            "enough information, say so." + system_suffix
        ),
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }]
    )
    return response.content[0].text

Measuring the Difference

The whole point of agentic RAG is measurable improvement. Here is how to build a retrieval evaluation set and track the metrics that matter.

@dataclass
class EvalCase:
    query: str
    expected_sources: list[str]  # doc IDs that should be retrieved
    expected_answer_contains: list[str]  # key facts the answer must include
 
 
async def evaluate_pipeline(
    eval_set: list[EvalCase],
    pipeline_fn,
    vector_store,
    bm25_index
) -> dict:
    """Evaluate a RAG pipeline against a ground-truth eval set."""
    metrics = {
        "retrieval_recall": [],      # did we find the right docs?
        "answer_completeness": [],   # did the answer include key facts?
        "hallucination_rate": [],    # did the answer include wrong facts?
        "latency_ms": [],
        "cost_usd": [],
    }
 
    for case in eval_set:
        start = time.time()
        result = await pipeline_fn(case.query, vector_store, bm25_index)
        elapsed = (time.time() - start) * 1000
 
        # Retrieval recall
        retrieved_sources = set(result["sources"])
        expected = set(case.expected_sources)
        recall = len(retrieved_sources & expected) / len(expected) if expected else 1.0
        metrics["retrieval_recall"].append(recall)
 
        # Answer completeness
        answer_lower = result["answer"].lower()
        found = sum(1 for fact in case.expected_answer_contains if fact.lower() in answer_lower)
        completeness = found / len(case.expected_answer_contains) if case.expected_answer_contains else 1.0
        metrics["answer_completeness"].append(completeness)
 
        metrics["latency_ms"].append(elapsed)
 
    return {k: sum(v) / len(v) for k, v in metrics.items() if v}

In production systems we have measured, agentic RAG with judging and retry improves retrieval recall from ~55% to ~82% and reduces hallucination rates by 3-4x compared to naive RAG on the same corpus.

The Cost Trade-Off

Agentic RAG uses 3-5x more LLM calls than naive RAG per query. On a simple query that routes through the fast path, the overhead is a single Haiku classification call (~$0.0001). On a complex query with decomposition, judging, and grounding, total cost is $0.02-0.08 per query. For most enterprise use cases, this is a rounding error compared to the cost of a wrong answer reaching a customer.

When to Use What

Not every application needs the full agentic pipeline. Here is the decision framework:

ScenarioArchitectureWhy
Internal search over well-structured docsNaive RAG + rerankerDocs are clean, stakes are low, latency matters
Customer-facing Q&AAgentic RAG (full pipeline)Wrong answers erode trust, stakes are high
Code generation from docsAgentic RAG + code validationGenerated code must compile and be correct
Multi-document synthesisAgentic RAG + decompositionSingle retrieval cannot span multiple sources
Chat over a single PDFLong-context model, no RAGThe whole document fits in context

The last row is important. With 200K+ context windows now standard, many use cases that previously required RAG no longer do. If your entire corpus fits in context, RAG adds complexity without adding value.

The Eval Set Is the Product

Here is the uncomfortable truth: your RAG pipeline is only as good as your evaluation set. If you do not have a curated set of queries with known-good answers and expected source documents, you are flying blind.

Build your eval set from:

  1. Support tickets. Real user queries with verified answers from your support team.
  2. Edge cases. Queries that previously returned wrong answers.
  3. Adversarial queries. Queries designed to confuse retrieval (negations, comparisons, time-sensitive questions).
  4. Multi-hop queries. Questions that require combining information from multiple documents.

Start with 50 eval cases. Grow to 200. Run evals on every pipeline change. This is not optional — this is how you move from "it seems to work" to "it works."

The shift from naive to agentic RAG mirrors a broader pattern in AI engineering: moving from open-loop to closed-loop systems. Naive RAG is a feed-forward pipeline. Agentic RAG is a control system with sensors (the judge), actuators (re-retrieval), and a feedback loop (grounding checks). The same pattern applies to AI code generation, AI data analysis, and any LLM-powered workflow where correctness matters.

Stop Praying, Start Judging

Naive RAG was a reasonable starting point in 2023. In 2026, it is malpractice for any application where answer quality matters. The components described here — routing, decomposition, hybrid retrieval, judging, and grounding — are not research prototypes. They are production patterns running at scale across enterprise deployments today.

The cost of a wrong answer is almost always higher than the cost of three extra LLM calls. Build the judge. Close the loop. Stop praying.


Sources & Further Reading

  1. Cormack, Clarke, Buettcher, "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods" (SIGIR 2009)
  2. Anthropic Claude API: Structured Output
  3. LlamaIndex: Agentic RAG
  4. Gao et al., "Retrieval-Augmented Generation for Large Language Models: A Survey" (2024)
  5. Pinecone: Hybrid Search with BM25

Was this article helpful?

Share:

Related Posts

Tutorials

AI Agents Keep Dying in Production. The Fix Was Invented in 1986.

Your agent framework handles the happy path. Erlang's supervision trees handled telecom uptime for 40 years. Here's how to apply the same 'let it crash' philosophy to make AI agents self-healing.

Read more
Tutorials

MCP, Agents, Skills, Subagents: The Definitive Guide to AI's New Building Blocks

Everyone's building with agents, MCP servers, skills, and subagents. Almost nobody can explain when to use which. This is the guide that fixes that — with architecture diagrams, production code, and a decision framework you can apply today.

Read more
Tutorials

DeerFlow 2.0: ByteDance Open-Sourced a Full-Stack SuperAgent. Here's the Complete Developer Guide.

ByteDance's DeerFlow 2.0 hit #1 on GitHub Trending with 39K stars in weeks. It's not another chatbot wrapper — it's a full-stack SuperAgent harness with sandboxed execution, persistent memory, sub-agents, and LangGraph orchestration. Here's everything you need to build with it.

Read more

Comments

No comments yet. Be the first to share your thoughts!

Leave a comment

Weekly AI insights

Join developers getting LLM tips, ML guides, and tool reviews.

Sponsor this space

Reach thousands of AI engineers weekly.