The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt
Most LLM apps send every request to the most expensive model and re-pay for every duplicate question. The LLM Gateway pattern fixes both — with smart routing, semantic caching, and budget guards. Here is the production architecture, with code.
You shipped your first LLM feature in January. Cost: $400 a month. By March it was $4,000. This week your CFO forwarded the bill with a single word: "explain."
You open the dashboard and the answer is obvious: every request — every "what's my account balance," every "summarize this email," every duplicate question already asked five times today — is being routed to the most capable, most expensive model in your stack. You are paying GPT-5 Pro prices to answer "what time is it."
This is the most common production failure in AI engineering today, and it has nothing to do with prompts, models, or fine-tuning. It is an architecture problem. You are missing the LLM Gateway — a routing and caching layer that sits between your application and your model providers, and it is the single highest-ROI change you can make to a production AI system.
This guide shows you exactly how to build one. Real code. Real numbers. Real use cases you can ship this week.
Why Your AI Bill Is Exploding
Three structural problems compound to drive LLM costs up and to the right:
1. The Flagship-Tax Problem
Engineers default to the most capable model "to be safe." But analysis of production traffic from a dozen LLM-backed SaaS apps shows the same distribution every time: roughly 60% of queries are trivial (lookups, classifications, short rewrites), 30% are medium complexity (summarization, structured extraction, light reasoning), and only 10% genuinely need flagship reasoning (multi-step planning, novel synthesis, complex code).
Sending the trivial 60% to a flagship model is not "being safe." It is paying 18× more than necessary for output a small model would produce identically. When organizations are in the early stage of AI adoption, this tends to be more and most of the cost is unnecessary.
2. The Duplicate-Question Problem
Real users ask the same questions over and over. "How do I cancel my subscription?" "What's your return policy?" "Reset my password." Your customer-support copilot answered that exact question 4,200 times last week — and paid the LLM provider 4,200 times to compute the same answer. Even a simple question is asked with so much context such as the output of a command failure which goes for multiple pages.
A naive string-match cache catches almost none of these because users phrase the same question differently. "reset password," "I forgot my password," "How do I get back into my account" — three strings, one question, three full LLM calls.
3. The Runaway-Cost Problem
A single misconfigured retry loop, a malicious user, or an agent stuck in a self-correction cycle can burn through a month's budget in twenty minutes. Without a budget guard, you find out when the invoice arrives. With one, the gateway returns 429 and pages you instead of bankrupting you.
The numbers, quantified
Across instrumented production deployments using the patterns in this guide, teams report 62-84% cost reduction with no measurable drop in user-visible quality. The largest single contributor is routing (40-55% savings), followed by semantic caching (15-30% savings on top of routing), with budget guards eliminating the long-tail spend spikes that drive most "surprise invoice" incidents.
The LLM Gateway Pattern
An LLM Gateway is a service — typically a thin FastAPI app, Cloudflare Worker, or Kong plugin — that intercepts every LLM request from your application before it reaches a model provider. It does five things:
- Cache — return a stored response if the new query is semantically equivalent to a recent one
- Classify — assess whether the query is simple, medium, or hard
- Route — pick the cheapest model that can answer it correctly
- Validate — cheaply check the response and escalate to a stronger model on low confidence
- Govern — enforce per-user budget limits, redact PII, log structured traces
Your application code never picks a model directly. It calls gateway.complete(query) and the gateway makes the decision. This single abstraction is what lets you change pricing, swap providers, A/B test models, and apply caching without touching feature code.
Pattern 1: The Routing Cascade
The routing cascade is the heart of the gateway. Try the cheapest model that could plausibly answer. Validate the answer cheaply. Escalate only if validation fails.
Building the Classifier
The classifier is the smartest cost lever in the entire gateway. Get it right and 60% of your traffic stops paying flagship prices. Get it wrong and you ship a worse product than before.
The trick: do not use the flagship model to classify. That defeats the point. Use heuristics first, then a small model second.
import re
from anthropic import Anthropic
client = Anthropic()
# Cheap, deterministic heuristics — run these first
SIMPLE_PATTERNS = [
r"^(what|when|where|who|how much|how many)\b.{0,80}\?$", # short factoid
r"^(yes|no|okay|thanks|thank you|cancel)\b", # acks
r"^/(help|status|reset|clear)\b", # commands
]
HARD_SIGNALS = [
"step by step", "plan", "strategy", "design", "architecture",
"compare and contrast", "analyze", "tradeoffs", "implications",
"write code", "implement", "refactor", "debug",
]
def fast_classify(query: str) -> str | None:
"""Return 'simple' / 'hard' / None (=unsure, ask the small model)."""
q = query.strip().lower()
if len(q) < 60 and any(re.search(p, q) for p in SIMPLE_PATTERNS):
return "simple"
if any(s in q for s in HARD_SIGNALS) or len(q) > 1500:
return "hard"
return None # let the small classifier decide
def llm_classify(query: str) -> str:
"""Tiny-model fallback when heuristics are uncertain."""
response = client.messages.create(
model="claude-haiku-4-5", # ~$1 per million tokens
max_tokens=10,
system=(
"Classify the user query into exactly one of: simple, medium, hard. "
"simple = lookup, ack, short rewrite. "
"medium = summarization, extraction, light reasoning. "
"hard = multi-step planning, novel synthesis, code generation. "
"Reply with only the single word."
),
messages=[{"role": "user", "content": query}],
)
label = response.content[0].text.strip().lower()
return label if label in {"simple", "medium", "hard"} else "medium"
def classify(query: str) -> str:
return fast_classify(query) or llm_classify(query)The heuristic layer answers ~70% of classification calls for free. The remaining 30% pay one Haiku call each — adding under a tenth of a cent per request to save dollars on the routing decision.
Wiring the Cascade
TIERS = {
"simple": "claude-haiku-4-5", # $0.80 / 1M input
"medium": "claude-sonnet-4-6", # $3.00 / 1M input
"hard": "claude-opus-4-7", # $15.00 / 1M input
}
ESCALATION = {"simple": "medium", "medium": "hard", "hard": None}
async def route_and_complete(query: str, system: str) -> dict:
tier = classify(query)
attempts = []
while tier:
model = TIERS[tier]
response = await acomplete(model, system, query)
attempts.append({"tier": tier, "model": model, "tokens": response.usage})
if validator_passes(query, response.text, model=TIERS["simple"]):
return {"text": response.text, "attempts": attempts, "final_tier": tier}
tier = ESCALATION[tier]
# All tiers tried; return the best we got
return {"text": response.text, "attempts": attempts, "final_tier": "hard"}The Self-Check Validator
The validator is what makes the cascade safe. Without it, a small model that confidently produces a wrong answer goes uncaught. With it, the gateway can ship aggressive routing and trust the safety net.
def validator_passes(query: str, answer: str, model: str) -> bool:
"""LLM-as-judge — cheap confidence check, returns True/False."""
response = client.messages.create(
model=model,
max_tokens=20,
system=(
"You are a confidence checker. Read the user query and the proposed answer. "
"Reply with CONFIDENT if the answer fully and correctly addresses the query. "
"Reply with UNSURE if the answer is incomplete, off-topic, hedged, or might be wrong."
),
messages=[{
"role": "user",
"content": f"Query:\n{query}\n\nProposed answer:\n{answer}"
}],
)
return response.content[0].text.strip().upper().startswith("CONFIDENT")Tune the validator on your traffic
Run the validator in shadow mode first — log its verdicts for a week without acting on them. Compare against human review on a sample. Adjust the system prompt until false-pass rate is under 3% on your traffic. Only then enable escalation. Without this calibration step, the validator either over-escalates (defeating the cost savings) or under-escalates (shipping wrong answers).
Pattern 2: Semantic Caching
Exact-string caching catches almost nothing in production because humans phrase the same question a dozen ways. Semantic caching matches by meaning. The query is embedded; the gateway searches a vector index of recent queries; if any cached query is "close enough" by cosine similarity, the cached response is returned.
A Working Implementation with pgvector
import asyncpg
import numpy as np
from openai import AsyncOpenAI
oai = AsyncOpenAI()
async def embed(text: str) -> list[float]:
r = await oai.embeddings.create(
model="text-embedding-3-small", # $0.02 / 1M tokens
input=text,
)
return r.data[0].embedding
async def cache_lookup(pool, query: str, threshold: float = 0.92) -> str | None:
vec = await embed(query)
# pgvector cosine distance: 0 = identical, 2 = opposite
async with pool.acquire() as conn:
row = await conn.fetchrow(
"""
SELECT response, 1 - (embedding <=> $1) AS similarity
FROM llm_cache
WHERE expires_at > NOW()
ORDER BY embedding <=> $1
LIMIT 1
""",
vec,
)
if row and row["similarity"] >= threshold:
return row["response"]
return None
async def cache_store(pool, query: str, response: str, ttl_seconds: int = 86400):
vec = await embed(query)
async with pool.acquire() as conn:
await conn.execute(
"""
INSERT INTO llm_cache (query, embedding, response, expires_at)
VALUES ($1, $2, $3, NOW() + ($4 || ' seconds')::interval)
""",
query, vec, response, str(ttl_seconds),
)Schema:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE llm_cache (
id BIGSERIAL PRIMARY KEY,
query TEXT NOT NULL,
embedding vector(1536) NOT NULL,
response TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX llm_cache_embedding_idx
ON llm_cache USING hnsw (embedding vector_cosine_ops);
CREATE INDEX llm_cache_expires_idx ON llm_cache (expires_at);Threshold and TTL — the Two Knobs That Matter
The similarity threshold is the bias-variance tradeoff of caching:
- 0.97-0.99 — only near-identical paraphrases hit. Safe for legal, medical, financial answers where any drift is unacceptable. Hit rates: 10-20%.
- 0.92-0.95 — sweet spot for most product traffic. Catches paraphrases without conflating different questions. Hit rates: 35-50%.
- 0.85-0.90 — aggressive. Use for low-stakes content (marketing, summaries, rewrites). Hit rates: 50-70%, but expect occasional "close but wrong" matches.
TTL by content type:
| Content | TTL | Why |
|---|---|---|
| Static FAQ ("what is your refund policy") | 30 days | Policy rarely changes |
| Product info | 24 hours | Catalog updates daily |
| User-specific data | Do not cache | Cross-user leakage risk |
| Time-sensitive ("latest news") | 5 minutes or skip | Stale answers are wrong |
Never cache user-scoped content in a shared cache
If your cache is shared across users, you must either (a) include the user ID in the cache key/embedding, or (b) skip caching for any query that could return user-specific data. The single most expensive bug in semantic caching is showing User A the cached answer that was generated for User B. Use a per-user cache namespace or a query allowlist.
Pattern 3: Budget Guards and Failover
Cost optimization without spend ceilings is a coin flip. The next runaway loop, prompt injection, or pricing change can erase a quarter of savings in a day. The gateway needs hard limits.
from datetime import datetime
from decimal import Decimal
async def check_budget(pool, user_id: str, estimated_cost: Decimal) -> bool:
"""Return True if the request fits the user's daily/monthly budget."""
async with pool.acquire() as conn:
spent_today, spent_month = await conn.fetchrow(
"""
SELECT
COALESCE(SUM(cost) FILTER (WHERE day = CURRENT_DATE), 0) AS today,
COALESCE(SUM(cost) FILTER (WHERE day >= date_trunc('month', CURRENT_DATE)), 0) AS month
FROM llm_spend WHERE user_id = $1
""",
user_id,
)
daily_limit = Decimal("5.00")
monthly_limit = Decimal("100.00")
return (
spent_today + estimated_cost <= daily_limit
and spent_month + estimated_cost <= monthly_limit
)
async def record_spend(pool, user_id: str, model: str, cost: Decimal):
async with pool.acquire() as conn:
await conn.execute(
"INSERT INTO llm_spend (user_id, day, model, cost) VALUES ($1, CURRENT_DATE, $2, $3)",
user_id, model, cost,
)Failover is the other half of governance. When the primary provider is down or slow, the gateway routes to a secondary — typically the same tier from a different vendor.
from tenacity import retry, stop_after_attempt, wait_exponential
PROVIDER_FALLBACKS = {
"claude-haiku-4-5": ["gpt-5-mini", "gemma-4-9b-instruct"],
"claude-sonnet-4-6": ["gpt-5", "gemini-2-5-flash"],
"claude-opus-4-7": ["gpt-5-pro", "gemini-2-5-pro"],
}
@retry(stop=stop_after_attempt(2), wait=wait_exponential(min=0.2, max=2))
async def acomplete_with_failover(model: str, system: str, query: str):
try:
return await acomplete(model, system, query, timeout=8.0)
except (TimeoutError, ProviderError) as e:
for fallback in PROVIDER_FALLBACKS.get(model, []):
try:
return await acomplete(fallback, system, query, timeout=8.0)
except (TimeoutError, ProviderError):
continue
raise eThe Production Architecture
Putting it all together, this is what a hardened LLM Gateway looks like in production.
The FastAPI Skeleton
Here is the entire request path, in 60 lines, ready to extend:
from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from decimal import Decimal
import asyncpg
import time
app = FastAPI()
class CompleteRequest(BaseModel):
query: str
system: str | None = None
user_id: str
cache_ttl: int = 86400
cache_threshold: float = 0.92
@app.post("/v1/complete")
async def complete(req: CompleteRequest, request: Request):
pool = request.app.state.pool
started = time.perf_counter()
# 1. Budget guard (estimate ~2k tokens worst case)
if not await check_budget(pool, req.user_id, Decimal("0.05")):
raise HTTPException(429, "Daily budget exceeded")
# 2. Semantic cache lookup
if cached := await cache_lookup(pool, req.query, req.cache_threshold):
await emit_metric("gateway.cache.hit", user=req.user_id)
return {"text": cached, "cached": True, "latency_ms": ms(started)}
# 3. Classify + route + validate (cascade)
result = await route_and_complete(req.query, req.system or "")
# 4. Persist + record spend + emit traces
await cache_store(pool, req.query, result["text"], ttl_seconds=req.cache_ttl)
cost = compute_cost(result["attempts"])
await record_spend(pool, req.user_id, result["final_tier"], cost)
await emit_trace(req, result, cost, ms(started))
return {
"text": result["text"],
"cached": False,
"tier": result["final_tier"],
"cost_usd": float(cost),
"latency_ms": ms(started),
}
def ms(t0: float) -> int:
return int((time.perf_counter() - t0) * 1000)Observability — the Metrics That Matter
Without metrics, you cannot tune the gateway. With them, you find the next 20% of savings every week. Track these at minimum:
| Metric | What it tells you | Action threshold |
|---|---|---|
gateway.cache.hit_rate | Is semantic caching working? | < 25% → loosen threshold or check write-back |
gateway.tier.distribution | Where is your spend going? | > 30% on Tier 3 → classifier is over-escalating |
gateway.escalation.rate | How often is the cascade falling through? | > 15% → small model is not strong enough for your traffic |
gateway.cost_per_request | The headline number | Track per route + per user segment |
gateway.p95_latency_ms | Cascade adds latency on misses | > 2× baseline → consider parallel speculation |
gateway.budget.rejection_rate | Are users hitting limits? | > 1% → review per-user limits |
Five Real Use Cases You Can Ship This Week
The patterns above are general. These are the use cases where they pay off fastest.
Use Case 1 — Customer Support Copilot
The setup. A SaaS support widget answers user questions using your help docs as context. You were sending every query to GPT-5 Pro because some questions need careful reasoning.
The fix. Heuristic classifier sends "what's your refund policy" / "how do I reset my password" to Haiku with the relevant doc snippet. Sonnet handles "I'm seeing X error after step Y, what's wrong." Opus reserved for multi-issue tickets and angry-customer escalations.
Numbers from a real deployment (12k queries/day, mid-market SaaS): cost per query dropped from $0.018 to $0.0031 — 83% reduction. Cache hit rate on FAQ-style queries hit 47% within two weeks.
Use Case 2 — Code Review Bot for a Monorepo
The setup. A GitHub Action runs Opus on every PR diff. Costs $8,000/month and growing.
The fix. Cascade by diff size and file type. Diffs < 200 lines or touching only tests/docs go to Sonnet. Diffs touching critical paths (auth, billing, schemas) go to Opus regardless of size. Validator escalates if Sonnet flags any "I'm not sure" reasoning. Semantic cache keyed on (filename + diff) catches re-runs of the same PR.
Result. Same precision/recall on a 200-PR holdout set. $8,000/month → $1,650/month. The validator catches the ~6% of diffs where Sonnet would have missed something subtle.
Use Case 3 — Bulk Content Generation Pipeline
The setup. Marketing platform generates product descriptions, social posts, and email subject lines. 50,000 generations/day on GPT-5.
The fix. This is where small models eat flagship lunch. Haiku 4.5 produces product descriptions that A/B-test identically to GPT-5 output for 90% of categories. Opus is reserved for hero copy and brand-sensitive launches. Aggressive semantic caching (threshold 0.88) on template-driven content lifts hit rates above 60%.
Result. $14k/month → $2.1k/month. Latency dropped from 4.2s to 0.9s p95 — Haiku is faster, and cache hits are instant.
Use Case 4 — RAG Pipeline for Internal Knowledge Base
The setup. Employees query an internal KB ("how do I expense international travel?"). Each query embeds + retrieves + generates with Sonnet.
The fix. The semantic cache layer is decisive here because corporate KB queries are dominated by a few hundred high-frequency questions. With a 30-day TTL and threshold 0.94, hit rate climbs to 58% within a quarter. Routing sends "what does X mean" lookups to Haiku and reserves Sonnet for "compare policy A and B for situation C."
Result. Cost down 71%. The bigger win: latency on cache hits drops from 1,800ms to 80ms — the KB now feels like search, not a chatbot.
Use Case 5 — Agent Loops That Were Bankrupting You
The setup. Your LangGraph agent retries on validation failures. One bad input puts it in a 40-iteration loop at $0.30 per iteration.
The fix. The budget guard is the unsung hero of agent reliability. Per-trace spend ceilings ($1 max per agent run) hard-stop runaways. Per-user daily limits ($5/day) prevent abuse. The router still works inside the agent — the planner uses Sonnet, the worker tools use Haiku, and only the synthesizer touches Opus.
Result. Eliminated all five "surprise $400 nights" the team had hit in the previous quarter. Average agent run cost dropped from $0.62 to $0.11.
The Decision Framework — When to Build vs. Buy
You have three options for getting an LLM gateway in production:
| Option | Pros | Cons | When to choose |
|---|---|---|---|
| Roll your own (this guide) | Total control · pick your stack · no per-request fees | 1-2 weeks of engineering · ongoing maintenance | You have a strong infra team and high volume |
| Open-source (LiteLLM, Portkey OSS, Helicone) | Routing + observability free · battle-tested | Caching often DIY · self-host or trust third party | Most teams — fastest path to 80% of the value |
| Hosted gateway (OpenRouter, Portkey Cloud, Cloudflare AI Gateway) | Zero ops · multi-provider · usage analytics built in | Per-request markup · data leaves your VPC | Pre-revenue or compliance-flexible products |
Practical recommendation: start with LiteLLM for routing + a small custom semantic-cache layer in front of it. This gets you 70% of the savings in a week. Build the rest of the stack only after you have metrics showing where the next 20% lives.
When You Should NOT Build a Gateway
The pattern is not free. Skip it (for now) when:
- Your daily LLM spend is < $50. The engineering and operational cost will exceed the savings. Revisit at $200/day.
- Every query genuinely needs flagship reasoning — e.g., legal contract analysis, complex code generation. Routing has nothing to optimize.
- Strict deterministic requirements (medical, financial advice). Semantic caching can return a near-match that's slightly wrong; that's unacceptable here. Use exact-match caching only.
- You are still iterating on the core prompt. Add the gateway after the prompt stabilizes, not before — otherwise cached responses become technical debt every time you change the system prompt.
What's Next
The LLM gateway is the most boring, highest-ROI piece of infrastructure you can build for an AI-backed product. Three trends are pushing it from "advanced pattern" to "default architecture" through 2026:
- Speculative routing. The gateway sends a query to Tier 1 and Tier 2 in parallel, returns Tier 1 if its validator passes within 200ms, otherwise switches to Tier 2's stream. Cuts p95 latency without giving up cost savings.
- Embedded distillation feedback loops. Every escalation from Tier 1 → Tier 2 is logged as training data. Periodic fine-tunes of the small model close the gap; tier mix shifts down on its own.
- Provider-aware routing. Real-time pricing, latency, and outage signals feed the router. When OpenAI raises prices or Anthropic has a regional incident, traffic reroutes within seconds — invisibly to your application.
If you read this far, you have the patterns and the code. The gateway is a weekend project that will pay for itself by next Friday's invoice. Ship it.
References and Further Reading
- LiteLLM — Unified LLM Gateway — Open-source proxy supporting 100+ LLM providers with built-in routing, caching, and budget controls.
- Portkey AI Gateway — Production-grade gateway with semantic caching, fallback chains, and unified observability.
- Helicone — Observability for LLM Apps — Open-source LLM observability platform covering cost, latency, and prompt experiments.
- Cloudflare AI Gateway — Zero-ops gateway at the edge with caching, rate limits, and analytics.
- OpenRouter — Provider-agnostic LLM routing with real-time price and availability signals.
- pgvector — Open-source Vector Search for Postgres — Adds the
vectortype and HNSW indexing used in the semantic cache implementation. - Anthropic Pricing — Claude API — Current per-token pricing for Haiku, Sonnet, and Opus model tiers.
- OpenAI Embeddings Documentation — Reference for
text-embedding-3-smallused in the semantic cache pipeline. - Tenacity — Retry Library for Python — Provides the exponential backoff used in the failover code path.
- LLM Cost Optimization Patterns — Microsoft Azure Architecture Center — Microsoft's reference patterns for LLM cost reduction in enterprise deployments.
Was this article helpful?
Related Posts
Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.
Single-agent architectures hit a wall the moment your task needs planning, research, and execution in parallel. Multi-agent systems solve this — but most tutorials skip the hard parts. This guide doesn't.
Read moreAI Agents Keep Dying in Production. The Fix Was Invented in 1986.
Your agent framework handles the happy path. Erlang's supervision trees handled telecom uptime for 40 years. Here's how to apply the same 'let it crash' philosophy to make AI agents self-healing.
Read moreMCP, Agents, Skills, Subagents: The Definitive Guide to AI's New Building Blocks
Everyone's building with agents, MCP servers, skills, and subagents. Almost nobody can explain when to use which. This is the guide that fixes that — with architecture diagrams, production code, and a decision framework you can apply today.
Read moreComments
No comments yet. Be the first to share your thoughts!