AIStackInsightsAIStackInsights
HomeBlogCategoriesAboutNewsletter
AIStackInsightsAIStackInsights

Practical AI insights — LLMs, machine learning, prompt engineering, and the tools shaping the future.

Content

  • All Posts
  • LLMs
  • Tutorials
  • AI Tools

Company

  • About
  • Newsletter
  • RSS Feed

Connect

© 2026 AIStackInsights. All rights reserved.

Tutorials

Event-Sourced AI Agents: The Production Blueprint for 2026

Most AI agents fail in production because they are not replayable, testable, or safe. Learn an event-sourced architecture that gives your agents deterministic behavior, cost control, and enterprise-grade reliability.

AIStackInsights TeamMarch 17, 20267 min read
ai-agentsevent-sourcingarchitectureproductionreliabilitymcp

Everyone is shipping agents. Very few teams are shipping reliable agents.

The pattern is familiar: a promising demo reaches production, tool calls become flaky, prompts drift, costs spike, and debugging turns into archaeology across logs, traces, and screenshots. At that point, your "smart assistant" is really a nondeterministic workflow engine with expensive side effects.

If you are already using open protocols for tool integration, this architecture pairs well with MCP implementation patterns.

If you want to build an agent system that can survive real traffic, audits, and pager duty, stop treating agent behavior as a black box. Treat it as a state machine driven by immutable events.

This article lays out a practical architecture that developers can adopt now: an event-sourced agent runtime with deterministic replays, policy gates, and measurable quality loops.

Architecture Flow
 
User -> Gateway API -> Orchestrator -> Event Store
                              |-> Planner
                              |-> Policy Guard -> Tool Executor
                              |-> Memory Service
                              |-> Response Builder -> User
 
Offline Evaluator -> Event Store -> Dashboards/Alerts

Why Most Agent Architectures Break

Typical agent stacks optimize for speed of prototyping:

  1. Prompt + tool schema
  2. Let model choose tool calls
  3. Append outputs to context
  4. Return answer

This works for prototypes, but in production it fails on four dimensions:

  • No replayability: you cannot reliably reproduce why an answer happened.
  • No policy boundary: tool actions can bypass risk rules.
  • No cost controls: token and tool usage drift with prompt changes.
  • No quality loop: there is no deterministic offline evaluation against historical runs.

The fix is architectural, not just "better prompting."

The Core Idea: Event-Sourced Agent State

In event sourcing, you never store only the latest state. You store every meaningful event and rebuild state from that sequence.

If you want a deeper conceptual background, Martin Fowler's write-up on Event Sourcing is still the clearest reference.

For agents, each run becomes a timeline:

  • RunStarted
  • PlanGenerated
  • ToolProposed
  • ToolApproved or ToolRejected
  • ToolExecuted
  • ObservationCaptured
  • ResponseDrafted
  • ResponseFinalized
  • RunCompleted

With this model, you gain:

  • Deterministic replay in staging.
  • Postmortems with exact causal history.
  • Fine-grained metrics by event type.
  • Auditable policy decisions.

Reference Data Model

Use a minimal schema first:

create table agent_events (
  run_id text not null,
  seq bigint not null,
  event_type text not null,
  event_time timestamptz not null default now(),
  actor text not null, -- orchestrator, planner, tool_executor, policy_guard
  payload jsonb not null,
  primary key (run_id, seq)
);
 
create index idx_agent_events_type_time
  on agent_events(event_type, event_time desc);

run_id + seq gives total ordering per run. Keep ordering logic in the orchestrator to avoid races.

Design Rule

Never mutate historical events. If business logic changes, append compensating events.

Runtime Architecture You Can Ship

A production-ready flow should separate concerns into explicit services.

1. Gateway API

Responsibilities:

  • Request validation and auth
  • Tenant isolation
  • Idempotency key handling

2. Orchestrator

Responsibilities:

  • Single source of run sequencing
  • Event persistence
  • Timeout, retry, and cancellation

3. Planner

Responsibilities:

  • Produce a bounded plan: goals, steps, required tools, stop conditions
  • Return machine-readable plan JSON, not prose

4. Policy Guard

Responsibilities:

  • Evaluate each proposed tool call against policy
  • Enforce allow/deny/require-human rules
  • Emit explicit policy events

5. Tool Executor

Responsibilities:

  • Run tools in sandboxed environment
  • Apply per-tool timeout and budget limits
  • Normalize all outputs and errors into structured observations

6. Memory Service

Responsibilities:

  • Retrieve only task-relevant context
  • Apply TTL and PII redaction
  • Keep long-term memory opt-in, not default

7. Offline Evaluator

Responsibilities:

  • Replay historical runs with new prompts/models
  • Score quality, latency, cost, and policy compliance
  • Block unsafe prompt/model rollouts

Deterministic Planning Contract

The largest reliability jump comes from forcing the planner to emit constrained JSON.

type PlanStep = {
  id: string;
  objective: string;
  tool: "search_docs" | "query_sql" | "create_ticket" | "send_email";
  input: Record<string, unknown>;
  successCriteria: string;
};
 
type AgentPlan = {
  runId: string;
  maxSteps: number;
  budgetUsd: number;
  steps: PlanStep[];
  stopConditions: string[];
};

Then validate before execution:

import { z } from "zod";
 
const planSchema = z.object({
  runId: z.string().min(8),
  maxSteps: z.number().int().min(1).max(20),
  budgetUsd: z.number().min(0.01).max(10),
  steps: z.array(
    z.object({
      id: z.string(),
      objective: z.string().min(5),
      tool: z.enum(["search_docs", "query_sql", "create_ticket", "send_email"]),
      input: z.record(z.unknown()),
      successCriteria: z.string().min(5),
    })
  ).min(1),
  stopConditions: z.array(z.string()).min(1),
});

If parsing fails, emit PlanRejected and route to safe fallback behavior.

Policy-as-Code for Tool Safety

Do not bury safety in prompts. Put safety in executable policy.

Use this alongside established risk frameworks such as the OWASP Top 10 for LLM Applications.

version: 1
rules:
  - name: deny_external_email_without_approval
    when:
      tool: send_email
      input:
        recipient_domain_not_in: ["yourcompany.com"]
    action: require_human_approval
 
  - name: deny_destructive_sql
    when:
      tool: query_sql
      input:
        sql_matches_regex: "(?i)\\b(delete|drop|truncate|alter)\\b"
    action: deny
 
  - name: cap_ticket_creation_rate
    when:
      tool: create_ticket
      run_metrics:
        tickets_created_gt: 3
    action: deny

This makes behavior explainable in security reviews and incident analysis.

End-to-End Execution Loop

End-to-End Execution Loop
 
1) User sends request to Gateway API.
2) Orchestrator starts run and appends RunStarted.
3) Planner returns bounded plan JSON.
4) Policy Guard evaluates each proposed tool call.
5) Approved calls execute in Tool Executor.
6) Observations are appended as immutable events.
7) Orchestrator finalizes response and appends RunCompleted.
8) Offline evaluator replays runs for quality/cost/policy scoring.

Observability: What to Measure Every Day

At minimum, track these metrics by route, model, and tenant:

  • Success rate: runs completed without fallback or human takeover
  • Policy violation rate: denied or escalated actions per 100 runs
  • Cost per successful run: total tokens + tool costs / successful runs
  • Replay delta: quality difference between old and new prompts/models
  • Time-to-first-action: latency from RunStarted to first approved tool call

For telemetry implementation, instrument spans and attributes with OpenTelemetry.

For alerting, start simple:

  • Page if policy violation rate rises above baseline by 3x.
  • Warn if cost per successful run increases more than 30% day-over-day.
  • Block deployment if replay delta quality drops below threshold.

CI for Agent Changes (The Missing Discipline)

When you change prompts, tool descriptions, or model versions, run replay tests like you run unit tests.

pnpm agent:replay --dataset ./eval/runs.jsonl --candidate prompt-v42
pnpm agent:score --metrics quality,cost,latency,policy
pnpm agent:gate --min-quality 0.82 --max-cost-delta 0.15 --max-policy-regressions 0

A change should ship only if it beats baseline under agreed constraints.

For benchmark-driven regression testing, the LangSmith evaluation docs are a useful practical reference.

Hard Truth

Most teams do more testing for button colors than for autonomous tool-calling behavior. Reverse that.

A Practical Build Plan (Two Weeks)

Week 1:

  1. Build orchestrator + event append API.
  2. Add JSON planner contract with strict validation.
  3. Integrate 2-3 read-only tools behind policy guard.
  4. Emit structured events for every transition.

Week 2:

  1. Add replay runner over historical runs.
  2. Add policy regression checks in CI.
  3. Add dashboards for quality, cost, and violations.
  4. Add human approval queue for high-risk actions.

By day 14, you have an agent system that is debuggable, auditable, and safer to scale.

Opinionated Defaults for 2026

If you are starting today, use these defaults:

  • Keep a single orchestrator per run for deterministic ordering.
  • Use small, fast planning model + stronger synthesis model only at response stage.
  • Start with read-only tools; graduate to write actions with policy and approval gates.
  • Treat memory as retrieval with TTL, not permanent unbounded context.
  • Make every model or prompt change pass replay gates before production.

These defaults optimize for trust and velocity, not demo theatrics.

Final Takeaway

The breakthrough in agent development is not another clever prompt trick. It is adopting software architecture patterns that backend engineers already trust: event sourcing, policy-as-code, deterministic contracts, and replay testing.

Teams that do this will not just ship more agent features. They will ship systems developers can reason about, security can approve, and product can scale.

That is how agent engineering becomes a real discipline, not a sequence of demos.

References and Further Reading

  • Model Context Protocol: Developer Guide
  • Building Production RAG Applications
  • Event Sourcing by Martin Fowler
  • OpenTelemetry Documentation
  • OWASP Top 10 for LLM Applications
  • LinkedIn Engineering: Workflow-based AI Agent Systems
Share:

Related Posts

Tutorials

MCP: The Developer's Guide to the Protocol Quietly Rewiring AI Applications

Model Context Protocol (MCP) is becoming the USB-C of AI integration — a single standard for connecting LLMs to any tool, database, or API. Here's the architecture, the primitives, and how to build your first server.

Read more
Tutorials

Building Production RAG Applications: A Complete Guide

Learn how to build Retrieval-Augmented Generation systems that actually work in production — from chunking strategies to evaluation frameworks.

Read more
Large Language Models

Understanding the Transformer Architecture: From Attention to GPT

A deep dive into the transformer architecture that powers modern LLMs. Learn how self-attention, positional encoding, and feed-forward layers work together.

Read more
Weekly AI insights

Join developers getting LLM tips, ML guides, and tool reviews.

Ad Slot:

Sponsor this space

Reach thousands of AI engineers weekly.