Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

You built a single AI agent. It has tools. It has a system prompt. It reasons through problems step by step. It works great — until you ask it to do two things at once.

"Research competitor pricing, then write a report, then fact-check the report against our internal data." Your single agent tries to do all three sequentially. It burns through your context window by step two. By step three, it has forgotten half of what it researched. The report contradicts itself. Your stakeholder reads it, politely says "this isn't quite right," and goes back to doing it manually.

This is not a prompt engineering problem. It is not a model capability problem. It is an architecture problem. You gave one agent three jobs that require three different skill sets, three different tool configurations, and three different evaluation criteria. No single system prompt can hold all of that without tradeoffs.

The industry figured this out. Gartner reported a 1,445% surge in multi-agent system inquiries between Q1 2024 and Q2 2025. LangGraph crossed 126,000 GitHub stars. CrewAI hit 44,600. Every serious AI engineering team is moving from "one big agent" to "a team of specialized agents that coordinate."

But most tutorials show you the happy path: three agents in a loop, no error handling, no state management, no production concerns. This guide covers the real architecture — including the parts that break.

Why Single Agents Hit a Ceiling

Single agents fail at compound tasks for three structural reasons:

1. Context Window Contamination

A single agent doing research, analysis, and writing accumulates context from every step. By the time it reaches the writing phase, its context window contains raw search results, intermediate reasoning, failed tool calls, and correction attempts. The signal-to-noise ratio collapses. The model cannot distinguish between "information I gathered" and "information I should use."

2. Tool Configuration Conflicts

A research agent needs web search, document retrieval, and API access. A writing agent needs a style guide, templates, and formatting tools. A fact-checker needs source verification and citation tools. When you give all tools to one agent, it makes poor tool selection decisions. It will use web search when it should use internal retrieval. It will format when it should still be researching.

3. No Specialization, No Evaluation

A single agent cannot evaluate its own output against domain-specific criteria because it is using the same context for generation and evaluation. A dedicated fact-checker agent, with a fresh context window and a focused system prompt, catches errors that the original agent literally cannot see — because those errors are part of the context that generated them.

The Context Window Tax

In production benchmarks, single agents performing 3+ step compound tasks show a 35-50% quality degradation by the final step compared to the same model performing that step in isolation. This is not a model limitation — it is a context management failure that multi-agent architectures solve by giving each agent a clean slate.

The Three Multi-Agent Architectures

Not all multi-agent systems are the same. There are three dominant patterns, each suited to different problem structures.

Sequential Pipeline

Agents execute in a fixed order. Agent A's output becomes Agent B's input. Best for workflows with clear stages: research → draft → review → publish.

When to use: The task has a natural order. Each stage has clear input/output contracts. You need predictable execution time.

When it breaks: Any agent in the chain fails and there is no fallback. Earlier agents cannot benefit from later agents' feedback without re-running the entire pipeline.

Hierarchical Delegation

A supervisor agent receives the task, decomposes it, delegates subtasks to specialized worker agents, and synthesizes their outputs. This is the most common production pattern.

When to use: Tasks are decomposable. Workers have different tool sets. You need dynamic routing — the supervisor can assign different workers based on the input.

When it breaks: The supervisor becomes a bottleneck. If it misunderstands the task, it delegates incorrectly and every worker produces irrelevant output. Supervisor quality is the ceiling for the entire system.

Collaborative Network

Agents communicate peer-to-peer, critiquing and refining each other's work. Often used for adversarial validation — one agent generates, another attacks, a third resolves.

When to use: Quality matters more than speed. The task benefits from multiple perspectives. You need adversarial checking (legal, compliance, safety).

When it breaks: Without termination conditions, agents can loop forever. Three agents politely disagreeing with each other at $0.01/turn adds up fast.

Framework Decision Matrix

Before writing code, you need to pick a framework. Here is the honest comparison based on production use — not marketing pages.

Criterion	LangGraph	CrewAI	OpenAI Agents SDK	Claude Agent SDK
Architecture	Directed graph with state machines	Role-based crews with task delegation	Imperative handoff chains	Tool-use chain with sub-agents
Model lock-in	None (any LLM)	None (any LLM)	OpenAI only	Claude only
State management	Built-in checkpointing with time-travel	Sequential task output passing	Ephemeral context variables	Conversation-scoped
Streaming	Per-node token streaming	Limited	Full streaming	Full streaming
Human-in-the-loop	First-class support with breakpoints	Basic callback support	Via guardrails	Via tool approval
Observability	LangSmith integration	Basic logging	Built-in tracing	Built-in tracing
Learning curve	Steep (1-2 weeks)	Gentle (hours)	Minimal	Minimal
Best for	Complex stateful workflows	Fast prototyping, role-based teams	Quick single-model agents	Code-centric tasks

Framework Lock-in Is Real

Choosing a model-locked SDK (OpenAI Agents SDK, Claude Agent SDK) trades flexibility for simplicity. This works until the model provider raises prices, has an outage, or a competitor releases a better model for your use case. Model-agnostic frameworks (LangGraph, CrewAI) cost more setup time but let you swap models without rewriting your orchestration layer.

Building a Multi-Agent Research System with LangGraph

Let's build something real: a research system where a Planner decomposes queries, a Researcher gathers information, a Writer produces the output, and a Reviewer validates quality. This is the hierarchical delegation pattern.

Step 1: Define the Shared State

Every agent in LangGraph reads from and writes to a shared state object. This is the contract between agents — get it wrong and agents will talk past each other.

from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, END, START
from langgraph.graph.message import add_messages
 
class ResearchState(TypedDict):
    """Shared state passed between all agents in the graph."""
    query: str                                    # Original user query
    plan: list[str]                               # Decomposed sub-tasks
    research_results: Annotated[list, add_messages]  # Accumulated findings
    draft: str                                    # Written output
    review_feedback: str                          # Reviewer's assessment
    review_passed: bool                           # Gate: did the draft pass?
    revision_count: int                           # Safety valve for loops
    status: str                                   # Current pipeline stage

The Annotated[list, add_messages] pattern tells LangGraph to append new research results rather than overwrite them. Without this, each researcher call would erase the previous one's findings.

Step 2: Build the Agent Nodes

Each node is a Python function that takes the current state and returns a partial state update. This is where your LLM calls live.

from anthropic import Anthropic
 
client = Anthropic()
 
def planner_node(state: ResearchState) -> dict:
    """Decompose the query into research sub-tasks."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a research planner. Given a query, decompose it
        into 2-4 specific, searchable sub-questions. Return each
        sub-question on its own line. Nothing else.""",
        messages=[{"role": "user", "content": state["query"]}],
    )
    plan = [
        line.strip()
        for line in response.content[0].text.strip().split("\n")
        if line.strip()
    ]
    return {"plan": plan, "status": "planned"}
 
 
def researcher_node(state: ResearchState) -> dict:
    """Research each sub-task and accumulate findings."""
    findings = []
    for sub_task in state["plan"]:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system="""You are a research analyst. Given a specific question,
            provide a thorough, factual analysis with concrete data points.
            Cite your reasoning. Be precise, not verbose.""",
            messages=[{"role": "user", "content": sub_task}],
        )
        findings.append(f"## {sub_task}\n\n{response.content[0].text}")
    return {
        "research_results": findings,
        "status": "researched",
    }
 
 
def writer_node(state: ResearchState) -> dict:
    """Synthesize research into a coherent report."""
    research_context = "\n\n---\n\n".join(state["research_results"])
    feedback_note = ""
    if state.get("review_feedback"):
        feedback_note = (
            f"\n\nPrevious review feedback to address:\n"
            f"{state['review_feedback']}"
        )
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=f"""You are a technical writer. Synthesize the research below
        into a clear, well-structured report that directly answers the
        original query. Use specific data points from the research.
        Do not invent facts.{feedback_note}""",
        messages=[
            {"role": "user", "content": (
                f"Original query: {state['query']}\n\n"
                f"Research findings:\n{research_context}"
            )},
        ],
    )
    return {
        "draft": response.content[0].text,
        "status": "drafted",
    }
 
 
def reviewer_node(state: ResearchState) -> dict:
    """Evaluate the draft for accuracy and completeness."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a fact-checking reviewer. Evaluate this report
        against the research findings. Check for:
        1. Claims not supported by the research
        2. Missing key findings from the research
        3. Logical inconsistencies
        Respond with APPROVED if the report passes all checks.
        Otherwise, respond with REVISION NEEDED: followed by specific
        feedback.""",
        messages=[
            {"role": "user", "content": (
                f"Report:\n{state['draft']}\n\n"
                f"Research findings:\n"
                + "\n---\n".join(state["research_results"])
            )},
        ],
    )
    review_text = response.content[0].text.strip()
    passed = review_text.upper().startswith("APPROVED")
    return {
        "review_feedback": review_text,
        "review_passed": passed,
        "revision_count": state.get("revision_count", 0) + 1,
        "status": "approved" if passed else "needs_revision",
    }

Step 3: Wire the Graph with Conditional Edges

This is where LangGraph shines. The conditional edge after the reviewer creates a feedback loop — the writer gets another shot if the review fails, but only up to a limit.

def should_revise(state: ResearchState) -> Literal["writer", "end"]:
    """Route based on review outcome and revision count."""
    if state["review_passed"]:
        return "end"
    if state["revision_count"] >= 3:
        # Safety valve: accept the draft after 3 attempts
        return "end"
    return "writer"
 
 
# Assemble the graph
workflow = StateGraph(ResearchState)
 
# Add nodes
workflow.add_node("planner", planner_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("writer", writer_node)
workflow.add_node("reviewer", reviewer_node)
 
# Wire edges
workflow.add_edge(START, "planner")
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "writer")
workflow.add_edge("writer", "reviewer")
workflow.add_conditional_edges("reviewer", should_revise, {
    "writer": "writer",
    "end": END,
})
 
# Compile
app = workflow.compile()

Step 4: Run It

result = app.invoke({
    "query": "What are the cost and latency tradeoffs of running "
             "open-weight LLMs on-device vs. cloud API inference "
             "for mobile applications in 2026?",
    "plan": [],
    "research_results": [],
    "draft": "",
    "review_feedback": "",
    "review_passed": False,
    "revision_count": 0,
    "status": "started",
})
 
print(result["draft"])
print(f"\nRevisions required: {result['revision_count']}")
print(f"Final status: {result['status']}")

Here is the full execution flow:

LangGraph execution flow: START to Planner to Researcher to Writer to Reviewer. Reviewer routes APPROVED to END, or REVISION NEEDED back to Writer (up to 3 times before forcing END). — The LangGraph research pipeline with a conditional revision loop and a safety valve at 3 attempts.

The Same System in CrewAI — Half the Code, Half the Control

If LangGraph is the manual transmission, CrewAI is the automatic. You define roles and tasks. The framework handles orchestration. Here is the same research pipeline:

from crewai import Agent, Task, Crew, Process
 
planner = Agent(
    role="Research Planner",
    goal="Decompose complex queries into specific research sub-questions",
    backstory="You are a senior research strategist who excels at "
              "breaking down complex topics into focused investigations.",
    verbose=False,
)
 
researcher = Agent(
    role="Research Analyst",
    goal="Gather thorough, factual information with concrete data points",
    backstory="You are a meticulous analyst who prioritizes accuracy "
              "and always supports claims with evidence.",
    verbose=False,
)
 
writer = Agent(
    role="Technical Writer",
    goal="Synthesize research into clear, well-structured reports",
    backstory="You are a technical writer known for turning complex "
              "research into accessible, authoritative content.",
    verbose=False,
)
 
reviewer = Agent(
    role="Fact-Checking Reviewer",
    goal="Validate report accuracy against source research",
    backstory="You are an editor who catches unsupported claims, "
              "missing context, and logical gaps.",
    verbose=False,
)
 
# Define tasks with explicit dependencies
plan_task = Task(
    description="Decompose this query into 2-4 research sub-questions: "
                "{query}",
    expected_output="A numbered list of specific research sub-questions",
    agent=planner,
)
 
research_task = Task(
    description="Research each sub-question from the plan thoroughly. "
                "Provide factual analysis with data points.",
    expected_output="Detailed findings for each sub-question",
    agent=researcher,
)
 
write_task = Task(
    description="Write a comprehensive report synthesizing all research "
                "findings. Address the original query directly.",
    expected_output="A well-structured technical report",
    agent=writer,
)
 
review_task = Task(
    description="Fact-check the report against the research. Verify all "
                "claims are supported. Flag any gaps or errors.",
    expected_output="APPROVED or specific revision feedback",
    agent=reviewer,
)
 
crew = Crew(
    agents=[planner, researcher, writer, reviewer],
    tasks=[plan_task, research_task, write_task, review_task],
    process=Process.sequential,
    verbose=False,
)
 
result = crew.kickoff(inputs={"query": "Cost and latency tradeoffs..."})

The tradeoff is visible in the code. CrewAI requires no state schema, no edge wiring, no routing functions. But you also lose: checkpointing (if it fails at step 3, you restart from step 1), conditional loops (no built-in revision cycle), per-node streaming, and human-in-the-loop breakpoints.

Use CrewAI when you need a working multi-agent prototype in an afternoon. Use LangGraph when you need production-grade state management, error recovery, and observability.

Production Hardening: What the Tutorials Skip

Getting a multi-agent system to run is easy. Getting it to run reliably at 3am when your on-call engineer is asleep is another matter entirely.

Cost Control

Multi-agent systems multiply your API costs by the number of agents times the number of iterations. A four-agent pipeline with a revision loop can easily make 8-12 LLM calls per user request.

# Add token tracking to every node
import tiktoken
 
class CostTracker:
    def __init__(self, max_budget_usd: float = 0.50):
        self.total_input_tokens = 0
        self.total_output_tokens = 0
        self.max_budget = max_budget_usd
 
    def track(self, response) -> None:
        self.total_input_tokens += response.usage.input_tokens
        self.total_output_tokens += response.usage.output_tokens
 
    @property
    def estimated_cost(self) -> float:
        # Claude Sonnet 4.6 pricing
        input_cost = (self.total_input_tokens / 1_000_000) * 3.00
        output_cost = (self.total_output_tokens / 1_000_000) * 15.00
        return input_cost + output_cost
 
    def check_budget(self) -> None:
        if self.estimated_cost > self.max_budget:
            raise BudgetExceededError(
                f"Request cost ${self.estimated_cost:.4f} "
                f"exceeds budget ${self.max_budget:.2f}"
            )

Timeout and Retry Logic

Each agent node should have independent timeout and retry logic. A researcher hitting a rate limit should not kill the entire pipeline.

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
)
async def resilient_llm_call(client, **kwargs):
    """LLM call with exponential backoff."""
    return await asyncio.wait_for(
        client.messages.create(**kwargs),
        timeout=60.0,  # Per-call timeout
    )

Structured Logging for Debugging

When four agents are collaborating, "something went wrong" is useless. You need to know which agent, at which step, with what input, produced what output.

import structlog
 
logger = structlog.get_logger()
 
def researcher_node(state: ResearchState) -> dict:
    logger.info(
        "researcher.start",
        sub_tasks=len(state["plan"]),
        revision_count=state.get("revision_count", 0),
    )
    # ... LLM call ...
    logger.info(
        "researcher.complete",
        findings_count=len(findings),
        total_chars=sum(len(f) for f in findings),
    )
    return {"research_results": findings, "status": "researched"}

The Production Architecture

Here is what the full system looks like with all the hardening layers in place:

Production multi-agent architecture with three layers beneath the LangGraph orchestration graph: Resilience Layer (retry with backoff, per-node timeouts), Observability Layer (structured logs, cost tracking), and Safety Layer (budget guard, max revision limit). — Every LLM call passes through all three hardening layers before reaching the orchestration graph.

Common Mistakes and How to Avoid Them

After building and debugging multi-agent systems, these are the failure modes I see repeatedly:

Mistake	What Happens	Fix
No revision limit	Reviewer and writer loop forever, burning tokens	Hard cap at 3 revisions with accept-and-warn
Shared system prompts	Agents compromise on everything, excel at nothing	Each agent gets a focused, single-purpose prompt
Passing full context between agents	Later agents drown in irrelevant information	Pass structured summaries, not raw conversation history
No cost tracking	A runaway loop costs $50 before anyone notices	Budget guard that kills the pipeline at a threshold
Synchronous everything	4 research sub-tasks run sequentially when they could parallelize	Use `asyncio.gather()` for independent sub-tasks within a node
Testing agents end-to-end only	Failures are impossible to diagnose	Unit test each node in isolation with fixed state inputs

When You Don't Need Multi-Agent

Multi-agent systems are not always the answer. Use a single agent when:

The task has one clear objective and one tool set
Context window requirements are under 50% of the model's limit
Latency matters more than quality (multi-agent adds 3-10x latency)
Your budget is tight (multi-agent multiplies costs by agent count)

The right question is not "should I use multi-agent?" but "does my task require different expertise at different stages?" If the answer is yes, you need multiple agents. If the answer is "it's complicated but one agent with good prompting could handle it," start with one agent and split later when you hit quality ceilings.

What's Next

Multi-agent architectures are moving fast. Three developments to watch:

LangGraph's remote graph execution — deploy individual agent nodes as separate services that scale independently. The planner runs on a small instance, the researcher scales horizontally.
MCP as the agent communication layer — Model Context Protocol is becoming the standard for giving agents access to external tools. Instead of hardcoding tools per agent, agents discover capabilities via MCP servers.
Persistent agent memory — agents that remember across sessions, not just within a single pipeline run. LangGraph's checkpointing already enables this; expect tighter framework-level support in 2026 H2.

The shift from single agents to multi-agent systems is the same shift that happened from monolithic to microservice architectures. The same tradeoffs apply: more power, more complexity, more failure modes to handle. But for compound AI tasks that need real quality, there is no going back.

References and Further Reading

LangGraph Documentation — Graph API Overview — Official LangGraph docs covering StateGraph, nodes, edges, and conditional routing.
LangGraph GitHub Repository — Source code and examples for building resilient language agents as graphs.
AI Agent Trends 2026 Report — Google Cloud — Google's analysis of the agentic AI landscape and multi-agent adoption trends.
Best Multi-Agent Frameworks in 2026: LangGraph, CrewAI, and More — Framework comparison covering production readiness, streaming, and state persistence.
2026 AI Agent Framework Showdown: Claude Agent SDK vs Strands vs LangGraph vs OpenAI Agents SDK — Detailed benchmarking of the major agent frameworks.
The Impact of AI on Software Engineers in 2026 — Pragmatic Engineer's analysis of how agentic AI is reshaping development workflows.
What's Next in AI: 7 Trends to Watch in 2026 — Microsoft — Microsoft's outlook on multi-agent systems, persistent agents, and repository intelligence.
10 Things That Matter in AI Right Now — MIT Technology Review — MIT Tech Review's roundup of the most significant AI developments in 2026.

You built a single AI agent. It has tools. It has a system prompt. It reasons through problems step by step. It works great — until you ask it to do two things at once.

Why Single Agents Hit a Ceiling

Single agents fail at compound tasks for three structural reasons:

1. Context Window Contamination

2. Tool Configuration Conflicts

3. No Specialization, No Evaluation

The Context Window Tax

The Three Multi-Agent Architectures

Not all multi-agent systems are the same. There are three dominant patterns, each suited to different problem structures.

Sequential Pipeline

Agents execute in a fixed order. Agent A's output becomes Agent B's input. Best for workflows with clear stages: research → draft → review → publish.

When to use: The task has a natural order. Each stage has clear input/output contracts. You need predictable execution time.

When it breaks: Any agent in the chain fails and there is no fallback. Earlier agents cannot benefit from later agents' feedback without re-running the entire pipeline.

Hierarchical Delegation

A supervisor agent receives the task, decomposes it, delegates subtasks to specialized worker agents, and synthesizes their outputs. This is the most common production pattern.

When to use: Tasks are decomposable. Workers have different tool sets. You need dynamic routing — the supervisor can assign different workers based on the input.

Collaborative Network

Agents communicate peer-to-peer, critiquing and refining each other's work. Often used for adversarial validation — one agent generates, another attacks, a third resolves.

When to use: Quality matters more than speed. The task benefits from multiple perspectives. You need adversarial checking (legal, compliance, safety).

When it breaks: Without termination conditions, agents can loop forever. Three agents politely disagreeing with each other at $0.01/turn adds up fast.

Framework Decision Matrix

Before writing code, you need to pick a framework. Here is the honest comparison based on production use — not marketing pages.

Criterion	LangGraph	CrewAI	OpenAI Agents SDK	Claude Agent SDK
Architecture	Directed graph with state machines	Role-based crews with task delegation	Imperative handoff chains	Tool-use chain with sub-agents
Model lock-in	None (any LLM)	None (any LLM)	OpenAI only	Claude only
State management	Built-in checkpointing with time-travel	Sequential task output passing	Ephemeral context variables	Conversation-scoped
Streaming	Per-node token streaming	Limited	Full streaming	Full streaming
Human-in-the-loop	First-class support with breakpoints	Basic callback support	Via guardrails	Via tool approval
Observability	LangSmith integration	Basic logging	Built-in tracing	Built-in tracing
Learning curve	Steep (1-2 weeks)	Gentle (hours)	Minimal	Minimal
Best for	Complex stateful workflows	Fast prototyping, role-based teams	Quick single-model agents	Code-centric tasks

Framework Lock-in Is Real

Building a Multi-Agent Research System with LangGraph

Step 1: Define the Shared State

Every agent in LangGraph reads from and writes to a shared state object. This is the contract between agents — get it wrong and agents will talk past each other.

from typing import TypedDict, Annotated, Literal
from langgraph.graph import StateGraph, END, START
from langgraph.graph.message import add_messages
 
class ResearchState(TypedDict):
    """Shared state passed between all agents in the graph."""
    query: str                                    # Original user query
    plan: list[str]                               # Decomposed sub-tasks
    research_results: Annotated[list, add_messages]  # Accumulated findings
    draft: str                                    # Written output
    review_feedback: str                          # Reviewer's assessment
    review_passed: bool                           # Gate: did the draft pass?
    revision_count: int                           # Safety valve for loops
    status: str                                   # Current pipeline stage

The Annotated[list, add_messages] pattern tells LangGraph to append new research results rather than overwrite them. Without this, each researcher call would erase the previous one's findings.

Step 2: Build the Agent Nodes

Each node is a Python function that takes the current state and returns a partial state update. This is where your LLM calls live.

from anthropic import Anthropic
 
client = Anthropic()
 
def planner_node(state: ResearchState) -> dict:
    """Decompose the query into research sub-tasks."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a research planner. Given a query, decompose it
        into 2-4 specific, searchable sub-questions. Return each
        sub-question on its own line. Nothing else.""",
        messages=[{"role": "user", "content": state["query"]}],
    )
    plan = [
        line.strip()
        for line in response.content[0].text.strip().split("\n")
        if line.strip()
    ]
    return {"plan": plan, "status": "planned"}
 
 
def researcher_node(state: ResearchState) -> dict:
    """Research each sub-task and accumulate findings."""
    findings = []
    for sub_task in state["plan"]:
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system="""You are a research analyst. Given a specific question,
            provide a thorough, factual analysis with concrete data points.
            Cite your reasoning. Be precise, not verbose.""",
            messages=[{"role": "user", "content": sub_task}],
        )
        findings.append(f"## {sub_task}\n\n{response.content[0].text}")
    return {
        "research_results": findings,
        "status": "researched",
    }
 
 
def writer_node(state: ResearchState) -> dict:
    """Synthesize research into a coherent report."""
    research_context = "\n\n---\n\n".join(state["research_results"])
    feedback_note = ""
    if state.get("review_feedback"):
        feedback_note = (
            f"\n\nPrevious review feedback to address:\n"
            f"{state['review_feedback']}"
        )
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=4096,
        system=f"""You are a technical writer. Synthesize the research below
        into a clear, well-structured report that directly answers the
        original query. Use specific data points from the research.
        Do not invent facts.{feedback_note}""",
        messages=[
            {"role": "user", "content": (
                f"Original query: {state['query']}\n\n"
                f"Research findings:\n{research_context}"
            )},
        ],
    )
    return {
        "draft": response.content[0].text,
        "status": "drafted",
    }
 
 
def reviewer_node(state: ResearchState) -> dict:
    """Evaluate the draft for accuracy and completeness."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="""You are a fact-checking reviewer. Evaluate this report
        against the research findings. Check for:
        1. Claims not supported by the research
        2. Missing key findings from the research
        3. Logical inconsistencies
        Respond with APPROVED if the report passes all checks.
        Otherwise, respond with REVISION NEEDED: followed by specific
        feedback.""",
        messages=[
            {"role": "user", "content": (
                f"Report:\n{state['draft']}\n\n"
                f"Research findings:\n"
                + "\n---\n".join(state["research_results"])
            )},
        ],
    )
    review_text = response.content[0].text.strip()
    passed = review_text.upper().startswith("APPROVED")
    return {
        "review_feedback": review_text,
        "review_passed": passed,
        "revision_count": state.get("revision_count", 0) + 1,
        "status": "approved" if passed else "needs_revision",
    }

Step 3: Wire the Graph with Conditional Edges

This is where LangGraph shines. The conditional edge after the reviewer creates a feedback loop — the writer gets another shot if the review fails, but only up to a limit.

def should_revise(state: ResearchState) -> Literal["writer", "end"]:
    """Route based on review outcome and revision count."""
    if state["review_passed"]:
        return "end"
    if state["revision_count"] >= 3:
        # Safety valve: accept the draft after 3 attempts
        return "end"
    return "writer"
 
 
# Assemble the graph
workflow = StateGraph(ResearchState)
 
# Add nodes
workflow.add_node("planner", planner_node)
workflow.add_node("researcher", researcher_node)
workflow.add_node("writer", writer_node)
workflow.add_node("reviewer", reviewer_node)
 
# Wire edges
workflow.add_edge(START, "planner")
workflow.add_edge("planner", "researcher")
workflow.add_edge("researcher", "writer")
workflow.add_edge("writer", "reviewer")
workflow.add_conditional_edges("reviewer", should_revise, {
    "writer": "writer",
    "end": END,
})
 
# Compile
app = workflow.compile()

Step 4: Run It

result = app.invoke({
    "query": "What are the cost and latency tradeoffs of running "
             "open-weight LLMs on-device vs. cloud API inference "
             "for mobile applications in 2026?",
    "plan": [],
    "research_results": [],
    "draft": "",
    "review_feedback": "",
    "review_passed": False,
    "revision_count": 0,
    "status": "started",
})
 
print(result["draft"])
print(f"\nRevisions required: {result['revision_count']}")
print(f"Final status: {result['status']}")

Here is the full execution flow:

The Same System in CrewAI — Half the Code, Half the Control

If LangGraph is the manual transmission, CrewAI is the automatic. You define roles and tasks. The framework handles orchestration. Here is the same research pipeline:

from crewai import Agent, Task, Crew, Process
 
planner = Agent(
    role="Research Planner",
    goal="Decompose complex queries into specific research sub-questions",
    backstory="You are a senior research strategist who excels at "
              "breaking down complex topics into focused investigations.",
    verbose=False,
)
 
researcher = Agent(
    role="Research Analyst",
    goal="Gather thorough, factual information with concrete data points",
    backstory="You are a meticulous analyst who prioritizes accuracy "
              "and always supports claims with evidence.",
    verbose=False,
)
 
writer = Agent(
    role="Technical Writer",
    goal="Synthesize research into clear, well-structured reports",
    backstory="You are a technical writer known for turning complex "
              "research into accessible, authoritative content.",
    verbose=False,
)
 
reviewer = Agent(
    role="Fact-Checking Reviewer",
    goal="Validate report accuracy against source research",
    backstory="You are an editor who catches unsupported claims, "
              "missing context, and logical gaps.",
    verbose=False,
)
 
# Define tasks with explicit dependencies
plan_task = Task(
    description="Decompose this query into 2-4 research sub-questions: "
                "{query}",
    expected_output="A numbered list of specific research sub-questions",
    agent=planner,
)
 
research_task = Task(
    description="Research each sub-question from the plan thoroughly. "
                "Provide factual analysis with data points.",
    expected_output="Detailed findings for each sub-question",
    agent=researcher,
)
 
write_task = Task(
    description="Write a comprehensive report synthesizing all research "
                "findings. Address the original query directly.",
    expected_output="A well-structured technical report",
    agent=writer,
)
 
review_task = Task(
    description="Fact-check the report against the research. Verify all "
                "claims are supported. Flag any gaps or errors.",
    expected_output="APPROVED or specific revision feedback",
    agent=reviewer,
)
 
crew = Crew(
    agents=[planner, researcher, writer, reviewer],
    tasks=[plan_task, research_task, write_task, review_task],
    process=Process.sequential,
    verbose=False,
)
 
result = crew.kickoff(inputs={"query": "Cost and latency tradeoffs..."})

Use CrewAI when you need a working multi-agent prototype in an afternoon. Use LangGraph when you need production-grade state management, error recovery, and observability.

Production Hardening: What the Tutorials Skip

Getting a multi-agent system to run is easy. Getting it to run reliably at 3am when your on-call engineer is asleep is another matter entirely.

Cost Control

Multi-agent systems multiply your API costs by the number of agents times the number of iterations. A four-agent pipeline with a revision loop can easily make 8-12 LLM calls per user request.

# Add token tracking to every node
import tiktoken
 
class CostTracker:
    def __init__(self, max_budget_usd: float = 0.50):
        self.total_input_tokens = 0
        self.total_output_tokens = 0
        self.max_budget = max_budget_usd
 
    def track(self, response) -> None:
        self.total_input_tokens += response.usage.input_tokens
        self.total_output_tokens += response.usage.output_tokens
 
    @property
    def estimated_cost(self) -> float:
        # Claude Sonnet 4.6 pricing
        input_cost = (self.total_input_tokens / 1_000_000) * 3.00
        output_cost = (self.total_output_tokens / 1_000_000) * 15.00
        return input_cost + output_cost
 
    def check_budget(self) -> None:
        if self.estimated_cost > self.max_budget:
            raise BudgetExceededError(
                f"Request cost ${self.estimated_cost:.4f} "
                f"exceeds budget ${self.max_budget:.2f}"
            )

Timeout and Retry Logic

Each agent node should have independent timeout and retry logic. A researcher hitting a rate limit should not kill the entire pipeline.

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
 
@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=30),
)
async def resilient_llm_call(client, **kwargs):
    """LLM call with exponential backoff."""
    return await asyncio.wait_for(
        client.messages.create(**kwargs),
        timeout=60.0,  # Per-call timeout
    )

Structured Logging for Debugging

When four agents are collaborating, "something went wrong" is useless. You need to know which agent, at which step, with what input, produced what output.

import structlog
 
logger = structlog.get_logger()
 
def researcher_node(state: ResearchState) -> dict:
    logger.info(
        "researcher.start",
        sub_tasks=len(state["plan"]),
        revision_count=state.get("revision_count", 0),
    )
    # ... LLM call ...
    logger.info(
        "researcher.complete",
        findings_count=len(findings),
        total_chars=sum(len(f) for f in findings),
    )
    return {"research_results": findings, "status": "researched"}

The Production Architecture

Here is what the full system looks like with all the hardening layers in place:

Common Mistakes and How to Avoid Them

After building and debugging multi-agent systems, these are the failure modes I see repeatedly:

Mistake	What Happens	Fix
No revision limit	Reviewer and writer loop forever, burning tokens	Hard cap at 3 revisions with accept-and-warn
Shared system prompts	Agents compromise on everything, excel at nothing	Each agent gets a focused, single-purpose prompt
Passing full context between agents	Later agents drown in irrelevant information	Pass structured summaries, not raw conversation history
No cost tracking	A runaway loop costs $50 before anyone notices	Budget guard that kills the pipeline at a threshold
Synchronous everything	4 research sub-tasks run sequentially when they could parallelize	Use `asyncio.gather()` for independent sub-tasks within a node
Testing agents end-to-end only	Failures are impossible to diagnose	Unit test each node in isolation with fixed state inputs

When You Don't Need Multi-Agent

Multi-agent systems are not always the answer. Use a single agent when:

The task has one clear objective and one tool set
Context window requirements are under 50% of the model's limit
Latency matters more than quality (multi-agent adds 3-10x latency)
Your budget is tight (multi-agent multiplies costs by agent count)

What's Next

Multi-agent architectures are moving fast. Three developments to watch:

LangGraph's remote graph execution — deploy individual agent nodes as separate services that scale independently. The planner runs on a small instance, the researcher scales horizontally.
MCP as the agent communication layer — Model Context Protocol is becoming the standard for giving agents access to external tools. Instead of hardcoding tools per agent, agents discover capabilities via MCP servers.
Persistent agent memory — agents that remember across sessions, not just within a single pipeline run. LangGraph's checkpointing already enables this; expect tighter framework-level support in 2026 H2.

References and Further Reading

LangGraph Documentation — Graph API Overview — Official LangGraph docs covering StateGraph, nodes, edges, and conditional routing.
LangGraph GitHub Repository — Source code and examples for building resilient language agents as graphs.
AI Agent Trends 2026 Report — Google Cloud — Google's analysis of the agentic AI landscape and multi-agent adoption trends.
Best Multi-Agent Frameworks in 2026: LangGraph, CrewAI, and More — Framework comparison covering production readiness, streaming, and state persistence.
2026 AI Agent Framework Showdown: Claude Agent SDK vs Strands vs LangGraph vs OpenAI Agents SDK — Detailed benchmarking of the major agent frameworks.
The Impact of AI on Software Engineers in 2026 — Pragmatic Engineer's analysis of how agentic AI is reshaping development workflows.
What's Next in AI: 7 Trends to Watch in 2026 — Microsoft — Microsoft's outlook on multi-agent systems, persistent agents, and repository intelligence.
10 Things That Matter in AI Right Now — MIT Technology Review — MIT Tech Review's roundup of the most significant AI developments in 2026.

Related Posts

AI Agents Keep Dying in Production. The Fix Was Invented in 1986.

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Naive RAG Is Dead. Here's What Replaced It.

Comments

Leave a comment

Related Posts

AI Agents Keep Dying in Production. The Fix Was Invented in 1986.

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Naive RAG Is Dead. Here's What Replaced It.

Comments

Leave a comment