Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

There is a version of AI-assisted development that everyone has experienced: you paste some code into a chat window, get back something that almost works, paste the error, get a fix, paste the next error, and so on. It's useful. It's also exhausting.

Then there is a version where the AI opens your codebase, reads your schema, checks your existing conventions, looks up the relevant API docs, and produces code you can merge without thinking hard. That version ships features. It finds bugs before you do. It writes tests that actually test the right things.

The difference is not the model. It is the context.

Context engineering — the discipline of deliberately shaping what information an AI model receives, when it receives it, and how it is structured — is the highest-leverage skill for developers building with AI in 2026. It is what separates teams shipping 10x faster from teams stuck in the paste-fix-paste loop.

What Context Engineering Actually Means

Prompt engineering taught developers that how you phrase a question matters. Context engineering goes further: it is about what information is present in the model's window at the moment it needs to reason.

This includes:

The system prompt and its architecture
Retrieved documents, code snippets, and schema definitions
Tool call results injected mid-conversation
Conversation history — what to keep, compress, or drop
External memory surfaced at the right moment
The order and position of information (because models are not uniformly attentive across their context window)

The "Lost in the Middle" problem: Research from Stanford (Liu et al., 2023) showed that LLMs consistently perform worse on information placed in the middle of long contexts compared to the beginning or end. Context engineering includes deciding where to place critical information, not just whether to include it.

The context window is prime real estate. Every token you put in it is a decision. Context engineering is the discipline of making those decisions well.

The Three Layers of Context

Think of context as three stacked layers, each with different tools and tradeoffs:

Layer	What it is	Tools	Latency
In-context	Everything in the active window	System prompts, retrieved chunks, tool outputs	Zero
External retrieval	Fetched on demand from stores	RAG, MCP servers, vector DBs	~50–300ms
Persistent memory	Stored across sessions	MemGPT/Letta, Zep, custom stores	~100–500ms

A well-designed AI development tool uses all three. The system prompt carries stable instructions and personas. Retrieval pulls the relevant code or docs for the current task. Persistent memory remembers that your team always uses PostgreSQL and never Redux.

Tool 1: MCP Servers — Structured On-Demand Context

The Model Context Protocol (MCP), open-sourced by Anthropic and now supported by OpenAI, Cursor, Windsurf, and dozens of tools, is the most important infrastructure piece of context engineering today.

Instead of stuffing everything into the system prompt upfront (expensive, often irrelevant), an MCP server exposes tools that the model can call to fetch exactly what it needs, when it needs it.

# A minimal MCP server that gives an AI your database schema on demand
from mcp.server import Server
from mcp.server.models import InitializationOptions
import mcp.types as types
import asyncio
 
server = Server("codebase-context")
 
@server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_schema",
            description="Fetch the current database schema for a given table",
            inputSchema={
                "type": "object",
                "properties": {
                    "table_name": {"type": "string", "description": "Table to inspect"}
                },
                "required": ["table_name"]
            }
        )
    ]
 
@server.call_tool()
async def handle_call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_schema":
        table = arguments["table_name"]
        schema = get_prisma_schema(table)  # your implementation
        return [types.TextContent(type="text", text=schema)]

The payoff: instead of pasting your 800-line Prisma schema into every prompt, the model fetches the one table it needs. Context stays small. Relevance stays high. Cost drops.

Build MCP servers for your internal tools first. Your ticket tracker, your internal docs, your deployment logs. These are exactly the sources of context that make AI responses go from generic to genuinely useful for your specific codebase. See the companion scripts for a ready-to-run MCP server template.

Tool 2: RAG Pipelines — Semantic Context Retrieval

Retrieval-Augmented Generation (RAG) is the practice of embedding your documents or codebase, then at query time retrieving the most semantically relevant chunks to inject into the model's window.

For developers, the killer use case is codebase-aware assistance. Rather than hoping the model knows your internal API, you index your source files and inject the relevant ones at call time:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
import anthropic
 
# One-time: index your codebase
documents = SimpleDirectoryReader("./src").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
 
# At query time: inject relevant code as context
def ask_with_codebase_context(question: str) -> str:
    nodes = retriever.retrieve(question)
    context_chunks = "\n\n".join([n.text for n in nodes])
 
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        system=f"""You are a senior engineer on this codebase.
        
RELEVANT CODE:
{context_chunks}
 
Answer based on the actual code above.""",
        messages=[{"role": "user", "content": question}]
    )
    return message.content[0].text

RAG is not new, but its integration patterns are maturing fast. The current best practice is hybrid retrieval: semantic similarity search plus keyword (BM25) search, merged via Reciprocal Rank Fusion. Purely semantic search misses exact identifiers like getUserById; purely keyword search misses conceptual matches. Hybrid gets both.

Tool 3: Memory Layers — Context That Persists

Single-session context is limiting. A coding agent that forgets your architectural decisions between sessions is a tool you have to babysit.

Memory-augmented agents like those built on Letta (formerly MemGPT) or Zep maintain a tiered memory system:

Core memory: Always in-context. Your agent's "working knowledge" — your name, your stack, your current project.
Archival memory: Retrieved on demand. Past decisions, resolved bugs, architectural notes.
Conversation history: Compressed summaries that replace raw transcripts after a session.

from letta import create_client
 
client = create_client()
 
agent = client.create_agent(
    name="dev-assistant",
    memory=client.create_memory(
        persona="You are a senior engineer who knows this codebase deeply.",
        human="Michael, full-stack developer, uses TypeScript + PostgreSQL + React Native."
    )
)
 
# This agent will remember cross-session what Michael told it
response = client.send_message(
    agent_id=agent.id,
    message="We always use Prisma for DB access, never raw SQL.",
    role="user"
)

Over time, the agent accumulates project-specific knowledge that would take a new human engineer weeks to absorb.

What Context Engineering Eliminates

The hallucination you're fighting is usually a context problem. When an AI confidently uses the wrong API endpoint, invents a function that doesn't exist, or ignores your team's conventions — it is almost never because the model is fundamentally incapable. It is because it lacked the information it needed. Context engineering is often the most effective anti-hallucination strategy.

Here is what well-engineered context makes unnecessary:

Manual task	Replaced by
Pasting error logs into chat	Tool call injects live logs automatically
Explaining your schema every session	MCP server exposes schema on demand
Re-teaching conventions each prompt	Persistent memory retains them
Hunting for the right file to reference	RAG surfaces it semantically
Long "background context" paragraphs	Structured system prompt + retrieval
Correcting wrong library versions	Tool fetches current `package.json`

Anthropic's own engineering team documented this in their Building Effective Agents guide: the teams with the best results weren't using more powerful models or more complex orchestration — they were feeding better information.

What It Opens Up

The flip side: when your AI agents have reliable access to the right context, entirely new development patterns become viable.

Autonomous code review that knows your standards. An agent with your team's style guide indexed and your recent PR history in memory can review a PR with the depth of a senior engineer who has been on the project for six months.

Self-updating documentation. An agent that can read your codebase via MCP, compare it against your docs, and flag (or fix) inconsistencies — run on every merge.

Codebase Q&A for non-engineers. Product managers and designers asking "does the app currently support multi-currency?" and getting an accurate answer, sourced from the actual code, not someone's memory of it.

Solo developers operating at team scale. With the right context infrastructure, a single developer can maintain a codebase the size of a small team's output — because the AI handles the load that used to require headcount.

The Discipline Shift

Prompt engineering asked: how do I phrase this better?

Context engineering asks: what does this model need to know, where does that information live, how do I get it there reliably, and how do I keep the window from filling up with noise?

It is less about clever wording and more about information architecture. The mental model is closer to designing a good database schema than writing a good essay. You are structuring information so that the right retrieval happens automatically.

The practical starting point is simple: audit your last ten AI interactions that produced poor results. In most cases, you'll find the model was missing a specific piece of information that you had and didn't think to provide. Context engineering is building systems that provide it automatically.

Getting Started: A Practical Checklist

Build one MCP server for your most-reached-for internal tool (schema, logs, tickets)
Index your codebase with LlamaIndex or LangChain — even a simple vector store beats nothing
Audit your system prompts — move generic instructions out, make them specific to your actual stack
Add a memory layer to any agent that crosses session boundaries
Use hybrid retrieval (semantic + BM25) for codebases with lots of identifiers
Put critical instructions at the beginning or end of context, not buried in the middle

The companion scripts for this article — an MCP server template, a hybrid RAG pipeline, and a Letta memory agent starter — are available at github.com/aistackinsights/stackinsights.

Sources & Further Reading

Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
Anthropic Engineering. (2024). Building Effective Agents
Model Context Protocol — Introduction. Anthropic, 2024
Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560
Hsieh, C., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models?. arXiv:2404.06654
LlamaIndex Documentation — RAG Pipeline. LlamaIndex, 2024
LangChain RAG — How To. LangChain, 2024
Zep — Memory for AI. Zep AI, 2024
Letta (MemGPT) — Open-Source Memory for Agents. Letta AI, 2024
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
Karpathy, A. (2025). Software 3.0. Twitter/X
Willison, S. (2024). Everything I know about context windows. Simon Willison's Weblog

The difference is not the model. It is the context.

What Context Engineering Actually Means

This includes:

The system prompt and its architecture
Retrieved documents, code snippets, and schema definitions
Tool call results injected mid-conversation
Conversation history — what to keep, compress, or drop
External memory surfaced at the right moment
The order and position of information (because models are not uniformly attentive across their context window)

The context window is prime real estate. Every token you put in it is a decision. Context engineering is the discipline of making those decisions well.

The Three Layers of Context

Think of context as three stacked layers, each with different tools and tradeoffs:

Layer	What it is	Tools	Latency
In-context	Everything in the active window	System prompts, retrieved chunks, tool outputs	Zero
External retrieval	Fetched on demand from stores	RAG, MCP servers, vector DBs	~50–300ms
Persistent memory	Stored across sessions	MemGPT/Letta, Zep, custom stores	~100–500ms

Tool 1: MCP Servers — Structured On-Demand Context

Instead of stuffing everything into the system prompt upfront (expensive, often irrelevant), an MCP server exposes tools that the model can call to fetch exactly what it needs, when it needs it.

# A minimal MCP server that gives an AI your database schema on demand
from mcp.server import Server
from mcp.server.models import InitializationOptions
import mcp.types as types
import asyncio
 
server = Server("codebase-context")
 
@server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_schema",
            description="Fetch the current database schema for a given table",
            inputSchema={
                "type": "object",
                "properties": {
                    "table_name": {"type": "string", "description": "Table to inspect"}
                },
                "required": ["table_name"]
            }
        )
    ]
 
@server.call_tool()
async def handle_call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_schema":
        table = arguments["table_name"]
        schema = get_prisma_schema(table)  # your implementation
        return [types.TextContent(type="text", text=schema)]

The payoff: instead of pasting your 800-line Prisma schema into every prompt, the model fetches the one table it needs. Context stays small. Relevance stays high. Cost drops.

Tool 2: RAG Pipelines — Semantic Context Retrieval

Retrieval-Augmented Generation (RAG) is the practice of embedding your documents or codebase, then at query time retrieving the most semantically relevant chunks to inject into the model's window.

For developers, the killer use case is codebase-aware assistance. Rather than hoping the model knows your internal API, you index your source files and inject the relevant ones at call time:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
import anthropic
 
# One-time: index your codebase
documents = SimpleDirectoryReader("./src").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
 
# At query time: inject relevant code as context
def ask_with_codebase_context(question: str) -> str:
    nodes = retriever.retrieve(question)
    context_chunks = "\n\n".join([n.text for n in nodes])
 
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        system=f"""You are a senior engineer on this codebase.
        
RELEVANT CODE:
{context_chunks}
 
Answer based on the actual code above.""",
        messages=[{"role": "user", "content": question}]
    )
    return message.content[0].text

Tool 3: Memory Layers — Context That Persists

Single-session context is limiting. A coding agent that forgets your architectural decisions between sessions is a tool you have to babysit.

Memory-augmented agents like those built on Letta (formerly MemGPT) or Zep maintain a tiered memory system:

Core memory: Always in-context. Your agent's "working knowledge" — your name, your stack, your current project.
Archival memory: Retrieved on demand. Past decisions, resolved bugs, architectural notes.
Conversation history: Compressed summaries that replace raw transcripts after a session.

from letta import create_client
 
client = create_client()
 
agent = client.create_agent(
    name="dev-assistant",
    memory=client.create_memory(
        persona="You are a senior engineer who knows this codebase deeply.",
        human="Michael, full-stack developer, uses TypeScript + PostgreSQL + React Native."
    )
)
 
# This agent will remember cross-session what Michael told it
response = client.send_message(
    agent_id=agent.id,
    message="We always use Prisma for DB access, never raw SQL.",
    role="user"
)

Over time, the agent accumulates project-specific knowledge that would take a new human engineer weeks to absorb.

What Context Engineering Eliminates

Here is what well-engineered context makes unnecessary:

Manual task	Replaced by
Pasting error logs into chat	Tool call injects live logs automatically
Explaining your schema every session	MCP server exposes schema on demand
Re-teaching conventions each prompt	Persistent memory retains them
Hunting for the right file to reference	RAG surfaces it semantically
Long "background context" paragraphs	Structured system prompt + retrieval
Correcting wrong library versions	Tool fetches current `package.json`

Build one MCP server for your most-reached-for internal tool (schema, logs, tickets)
Index your codebase with LlamaIndex or LangChain — even a simple vector store beats nothing
Audit your system prompts — move generic instructions out, make them specific to your actual stack
Add a memory layer to any agent that crosses session boundaries
Use hybrid retrieval (semantic + BM25) for codebases with lots of identifiers
Put critical instructions at the beginning or end of context, not buried in the middle

The companion scripts for this article — an MCP server template, a hybrid RAG pipeline, and a Letta memory agent starter — are available at github.com/aistackinsights/stackinsights.

Sources & Further Reading

Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
Anthropic Engineering. (2024). Building Effective Agents
Model Context Protocol — Introduction. Anthropic, 2024
Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560
Hsieh, C., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models?. arXiv:2404.06654
LlamaIndex Documentation — RAG Pipeline. LlamaIndex, 2024
LangChain RAG — How To. LangChain, 2024
Zep — Memory for AI. Zep AI, 2024
Letta (MemGPT) — Open-Source Memory for Agents. Letta AI, 2024
Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
Karpathy, A. (2025). Software 3.0. Twitter/X
Willison, S. (2024). Everything I know about context windows. Simon Willison's Weblog

Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

What Context Engineering Actually Means

The Three Layers of Context

Tool 1: MCP Servers — Structured On-Demand Context

Tool 2: RAG Pipelines — Semantic Context Retrieval

Tool 3: Memory Layers — Context That Persists

What Context Engineering Eliminates

What It Opens Up

The Discipline Shift

Getting Started: A Practical Checklist

Sources & Further Reading

Related Posts

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

Comments

Leave a comment

Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

What Context Engineering Actually Means

The Three Layers of Context

Tool 1: MCP Servers — Structured On-Demand Context

Tool 2: RAG Pipelines — Semantic Context Retrieval

Tool 3: Memory Layers — Context That Persists

What Context Engineering Eliminates

What It Opens Up

The Discipline Shift

Getting Started: A Practical Checklist

Sources & Further Reading

Related Posts

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

Comments

Leave a comment