AIStackInsightsAIStackInsights
HomeBlogCategoriesAboutNewsletter
AIStackInsightsAIStackInsights

Practical AI insights — LLMs, machine learning, prompt engineering, and the tools shaping the future.

Content

  • All Posts
  • LLMs
  • Tutorials
  • AI Tools

Company

  • About
  • Newsletter
  • RSS Feed

Connect

© 2026 AIStackInsights. All rights reserved.

AI Tools

Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

Prompt engineering was the skill of 2023. Context engineering is the discipline of 2026 — and it's the difference between AI that impresses in demos and AI that ships in production.

AIStackInsights TeamMarch 30, 202610 min read
ai-toolsprompt-engineeringtutorialsllms

There is a version of AI-assisted development that everyone has experienced: you paste some code into a chat window, get back something that almost works, paste the error, get a fix, paste the next error, and so on. It's useful. It's also exhausting.

Then there is a version where the AI opens your codebase, reads your schema, checks your existing conventions, looks up the relevant API docs, and produces code you can merge without thinking hard. That version ships features. It finds bugs before you do. It writes tests that actually test the right things.

The difference is not the model. It is the context.

Context engineering — the discipline of deliberately shaping what information an AI model receives, when it receives it, and how it is structured — is the highest-leverage skill for developers building with AI in 2026. It is what separates teams shipping 10x faster from teams stuck in the paste-fix-paste loop.

What Context Engineering Actually Means

Prompt engineering taught developers that how you phrase a question matters. Context engineering goes further: it is about what information is present in the model's window at the moment it needs to reason.

This includes:

  • The system prompt and its architecture
  • Retrieved documents, code snippets, and schema definitions
  • Tool call results injected mid-conversation
  • Conversation history — what to keep, compress, or drop
  • External memory surfaced at the right moment
  • The order and position of information (because models are not uniformly attentive across their context window)

The "Lost in the Middle" problem: Research from Stanford (Liu et al., 2023) showed that LLMs consistently perform worse on information placed in the middle of long contexts compared to the beginning or end. Context engineering includes deciding where to place critical information, not just whether to include it.

The context window is prime real estate. Every token you put in it is a decision. Context engineering is the discipline of making those decisions well.

The Three Layers of Context

Think of context as three stacked layers, each with different tools and tradeoffs:

LayerWhat it isToolsLatency
In-contextEverything in the active windowSystem prompts, retrieved chunks, tool outputsZero
External retrievalFetched on demand from storesRAG, MCP servers, vector DBs~50–300ms
Persistent memoryStored across sessionsMemGPT/Letta, Zep, custom stores~100–500ms

A well-designed AI development tool uses all three. The system prompt carries stable instructions and personas. Retrieval pulls the relevant code or docs for the current task. Persistent memory remembers that your team always uses PostgreSQL and never Redux.

Tool 1: MCP Servers — Structured On-Demand Context

The Model Context Protocol (MCP), open-sourced by Anthropic and now supported by OpenAI, Cursor, Windsurf, and dozens of tools, is the most important infrastructure piece of context engineering today.

Instead of stuffing everything into the system prompt upfront (expensive, often irrelevant), an MCP server exposes tools that the model can call to fetch exactly what it needs, when it needs it.

# A minimal MCP server that gives an AI your database schema on demand
from mcp.server import Server
from mcp.server.models import InitializationOptions
import mcp.types as types
import asyncio
 
server = Server("codebase-context")
 
@server.list_tools()
async def handle_list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="get_schema",
            description="Fetch the current database schema for a given table",
            inputSchema={
                "type": "object",
                "properties": {
                    "table_name": {"type": "string", "description": "Table to inspect"}
                },
                "required": ["table_name"]
            }
        )
    ]
 
@server.call_tool()
async def handle_call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "get_schema":
        table = arguments["table_name"]
        schema = get_prisma_schema(table)  # your implementation
        return [types.TextContent(type="text", text=schema)]

The payoff: instead of pasting your 800-line Prisma schema into every prompt, the model fetches the one table it needs. Context stays small. Relevance stays high. Cost drops.

Build MCP servers for your internal tools first. Your ticket tracker, your internal docs, your deployment logs. These are exactly the sources of context that make AI responses go from generic to genuinely useful for your specific codebase. See the companion scripts for a ready-to-run MCP server template.

Tool 2: RAG Pipelines — Semantic Context Retrieval

Retrieval-Augmented Generation (RAG) is the practice of embedding your documents or codebase, then at query time retrieving the most semantically relevant chunks to inject into the model's window.

For developers, the killer use case is codebase-aware assistance. Rather than hoping the model knows your internal API, you index your source files and inject the relevant ones at call time:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
import anthropic
 
# One-time: index your codebase
documents = SimpleDirectoryReader("./src").load_data()
index = VectorStoreIndex.from_documents(documents)
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)
 
# At query time: inject relevant code as context
def ask_with_codebase_context(question: str) -> str:
    nodes = retriever.retrieve(question)
    context_chunks = "\n\n".join([n.text for n in nodes])
 
    client = anthropic.Anthropic()
    message = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=2048,
        system=f"""You are a senior engineer on this codebase.
        
RELEVANT CODE:
{context_chunks}
 
Answer based on the actual code above.""",
        messages=[{"role": "user", "content": question}]
    )
    return message.content[0].text

RAG is not new, but its integration patterns are maturing fast. The current best practice is hybrid retrieval: semantic similarity search plus keyword (BM25) search, merged via Reciprocal Rank Fusion. Purely semantic search misses exact identifiers like getUserById; purely keyword search misses conceptual matches. Hybrid gets both.

Tool 3: Memory Layers — Context That Persists

Single-session context is limiting. A coding agent that forgets your architectural decisions between sessions is a tool you have to babysit.

Memory-augmented agents like those built on Letta (formerly MemGPT) or Zep maintain a tiered memory system:

  • Core memory: Always in-context. Your agent's "working knowledge" — your name, your stack, your current project.
  • Archival memory: Retrieved on demand. Past decisions, resolved bugs, architectural notes.
  • Conversation history: Compressed summaries that replace raw transcripts after a session.
from letta import create_client
 
client = create_client()
 
agent = client.create_agent(
    name="dev-assistant",
    memory=client.create_memory(
        persona="You are a senior engineer who knows this codebase deeply.",
        human="Michael, full-stack developer, uses TypeScript + PostgreSQL + React Native."
    )
)
 
# This agent will remember cross-session what Michael told it
response = client.send_message(
    agent_id=agent.id,
    message="We always use Prisma for DB access, never raw SQL.",
    role="user"
)

Over time, the agent accumulates project-specific knowledge that would take a new human engineer weeks to absorb.

What Context Engineering Eliminates

The hallucination you're fighting is usually a context problem. When an AI confidently uses the wrong API endpoint, invents a function that doesn't exist, or ignores your team's conventions — it is almost never because the model is fundamentally incapable. It is because it lacked the information it needed. Context engineering is often the most effective anti-hallucination strategy.

Here is what well-engineered context makes unnecessary:

Manual taskReplaced by
Pasting error logs into chatTool call injects live logs automatically
Explaining your schema every sessionMCP server exposes schema on demand
Re-teaching conventions each promptPersistent memory retains them
Hunting for the right file to referenceRAG surfaces it semantically
Long "background context" paragraphsStructured system prompt + retrieval
Correcting wrong library versionsTool fetches current package.json

Anthropic's own engineering team documented this in their Building Effective Agents guide: the teams with the best results weren't using more powerful models or more complex orchestration — they were feeding better information.

What It Opens Up

The flip side: when your AI agents have reliable access to the right context, entirely new development patterns become viable.

Autonomous code review that knows your standards. An agent with your team's style guide indexed and your recent PR history in memory can review a PR with the depth of a senior engineer who has been on the project for six months.

Self-updating documentation. An agent that can read your codebase via MCP, compare it against your docs, and flag (or fix) inconsistencies — run on every merge.

Codebase Q&A for non-engineers. Product managers and designers asking "does the app currently support multi-currency?" and getting an accurate answer, sourced from the actual code, not someone's memory of it.

Solo developers operating at team scale. With the right context infrastructure, a single developer can maintain a codebase the size of a small team's output — because the AI handles the load that used to require headcount.

The Discipline Shift

Prompt engineering asked: how do I phrase this better?

Context engineering asks: what does this model need to know, where does that information live, how do I get it there reliably, and how do I keep the window from filling up with noise?

It is less about clever wording and more about information architecture. The mental model is closer to designing a good database schema than writing a good essay. You are structuring information so that the right retrieval happens automatically.

The practical starting point is simple: audit your last ten AI interactions that produced poor results. In most cases, you'll find the model was missing a specific piece of information that you had and didn't think to provide. Context engineering is building systems that provide it automatically.

Getting Started: A Practical Checklist

  1. Build one MCP server for your most-reached-for internal tool (schema, logs, tickets)
  2. Index your codebase with LlamaIndex or LangChain — even a simple vector store beats nothing
  3. Audit your system prompts — move generic instructions out, make them specific to your actual stack
  4. Add a memory layer to any agent that crosses session boundaries
  5. Use hybrid retrieval (semantic + BM25) for codebases with lots of identifiers
  6. Put critical instructions at the beginning or end of context, not buried in the middle

The companion scripts for this article — an MCP server template, a hybrid RAG pipeline, and a Letta memory agent starter — are available at github.com/aistackinsights/stackinsights.

Sources & Further Reading

  1. Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172
  2. Anthropic Engineering. (2024). Building Effective Agents
  3. Model Context Protocol — Introduction. Anthropic, 2024
  4. Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560
  5. Hsieh, C., et al. (2024). RULER: What's the Real Context Size of Your Long-Context Language Models?. arXiv:2404.06654
  6. LlamaIndex Documentation — RAG Pipeline. LlamaIndex, 2024
  7. LangChain RAG — How To. LangChain, 2024
  8. Zep — Memory for AI. Zep AI, 2024
  9. Letta (MemGPT) — Open-Source Memory for Agents. Letta AI, 2024
  10. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
  11. Karpathy, A. (2025). Software 3.0. Twitter/X
  12. Willison, S. (2024). Everything I know about context windows. Simon Willison's Weblog

Was this article helpful?

Share:

Related Posts

AI Tools

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

LiteLLM 1.82.7 and 1.82.8 contained a credential stealer that ran on every Python startup without a single import. Here is the full technical post-mortem and what every AI developer must do right now.

Read more
AI Tools

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

AI is leaving the cloud. The next revolution isn't AGI — it's a billion cheap, autonomous agents running on the device in your hand, your wall, and your factory floor.

Read more
Large Language Models

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

ARC-AGI-3 just launched and current AI scores under 5%. The same week GPT-5.4 solved an open research math problem. This is not a contradiction. It is the most important insight about intelligence published this decade.

Read more

Comments

No comments yet. Be the first to share your thoughts!

Leave a comment

Weekly AI insights

Join developers getting LLM tips, ML guides, and tool reviews.

Sponsor this space

Reach thousands of AI engineers weekly.