The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

You shipped your first LLM feature in January. Cost: $400 a month. By March it was $4,000. This week your CFO forwarded the bill with a single word: "explain."

You open the dashboard and the answer is obvious: every request — every "what's my account balance," every "summarize this email," every duplicate question already asked five times today — is being routed to the most capable, most expensive model in your stack. You are paying GPT-5 Pro prices to answer "what time is it."

This is the most common production failure in AI engineering today, and it has nothing to do with prompts, models, or fine-tuning. It is an architecture problem. You are missing the LLM Gateway — a routing and caching layer that sits between your application and your model providers, and it is the single highest-ROI change you can make to a production AI system.

This guide shows you exactly how to build one. Real code. Real numbers. Real use cases you can ship this week.

Two-panel comparison: on the left, every query — trivial, repeated, or complex — is routed to GPT-5/Opus at $15 per million tokens, costing $52,000 per month. On the right, an LLM Gateway routes trivial queries to Haiku, returns cached answers for repeats, and reserves Opus for complex queries, dropping the bill to $9,800 per month. — Same traffic, two architectures — the gateway recovers $42,200/month from a real production workload by matching each query to the right model and returning cached answers for repeats.

Why Your AI Bill Is Exploding

Three structural problems compound to drive LLM costs up and to the right:

1. The Flagship-Tax Problem

Engineers default to the most capable model "to be safe." But analysis of production traffic from a dozen LLM-backed SaaS apps shows the same distribution every time: roughly 60% of queries are trivial (lookups, classifications, short rewrites), 30% are medium complexity (summarization, structured extraction, light reasoning), and only 10% genuinely need flagship reasoning (multi-step planning, novel synthesis, complex code).

Sending the trivial 60% to a flagship model is not "being safe." It is paying 18× more than necessary for output a small model would produce identically. When organizations are in the early stage of AI adoption, this tends to be more and most of the cost is unnecessary.

2. The Duplicate-Question Problem

Real users ask the same questions over and over. "How do I cancel my subscription?" "What's your return policy?" "Reset my password." Your customer-support copilot answered that exact question 4,200 times last week — and paid the LLM provider 4,200 times to compute the same answer. Even a simple question is asked with so much context such as the output of a command failure which goes for multiple pages.

A naive string-match cache catches almost none of these because users phrase the same question differently. "reset password," "I forgot my password," "How do I get back into my account" — three strings, one question, three full LLM calls.

3. The Runaway-Cost Problem

A single misconfigured retry loop, a malicious user, or an agent stuck in a self-correction cycle can burn through a month's budget in twenty minutes. Without a budget guard, you find out when the invoice arrives. With one, the gateway returns 429 and pages you instead of bankrupting you.

The numbers, quantified

Across instrumented production deployments using the patterns in this guide, teams report 62-84% cost reduction with no measurable drop in user-visible quality. The largest single contributor is routing (40-55% savings), followed by semantic caching (15-30% savings on top of routing), with budget guards eliminating the long-tail spend spikes that drive most "surprise invoice" incidents.

The LLM Gateway Pattern

An LLM Gateway is a service — typically a thin FastAPI app, Cloudflare Worker, or Kong plugin — that intercepts every LLM request from your application before it reaches a model provider. It does five things:

Cache — return a stored response if the new query is semantically equivalent to a recent one
Classify — assess whether the query is simple, medium, or hard
Route — pick the cheapest model that can answer it correctly
Validate — cheaply check the response and escalate to a stronger model on low confidence
Govern — enforce per-user budget limits, redact PII, log structured traces

Your application code never picks a model directly. It calls gateway.complete(query) and the gateway makes the decision. This single abstraction is what lets you change pricing, swap providers, A/B test models, and apply caching without touching feature code.

Pattern 1: The Routing Cascade

The routing cascade is the heart of the gateway. Try the cheapest model that could plausibly answer. Validate the answer cheaply. Escalate only if validation fails.

Vertical flow: incoming request enters semantic cache lookup; on hit, the cached response is returned; on miss, a complexity classifier sorts the query into simple, medium, or hard, routing to Tier 1 (Haiku/Gemma at $0.80/1M), Tier 2 (Sonnet/GPT-5 mini at $3/1M), or Tier 3 (Opus/GPT-5 Pro at $15/1M); a self-check validator reviews the response and either caches and returns it, or escalates to the next tier on low confidence. — The cascade is greedy on cost: every query starts at the cheapest tier that could plausibly handle it and only escalates if a validator rejects the answer.

Building the Classifier

The classifier is the smartest cost lever in the entire gateway. Get it right and 60% of your traffic stops paying flagship prices. Get it wrong and you ship a worse product than before.

The trick: do not use the flagship model to classify. That defeats the point. Use heuristics first, then a small model second.

import re
from anthropic import Anthropic
 
client = Anthropic()
 
# Cheap, deterministic heuristics — run these first
SIMPLE_PATTERNS = [
    r"^(what|when|where|who|how much|how many)\b.{0,80}\?$",  # short factoid
    r"^(yes|no|okay|thanks|thank you|cancel)\b",              # acks
    r"^/(help|status|reset|clear)\b",                          # commands
]
 
HARD_SIGNALS = [
    "step by step", "plan", "strategy", "design", "architecture",
    "compare and contrast", "analyze", "tradeoffs", "implications",
    "write code", "implement", "refactor", "debug",
]
 
def fast_classify(query: str) -> str | None:
    """Return 'simple' / 'hard' / None (=unsure, ask the small model)."""
    q = query.strip().lower()
    if len(q) < 60 and any(re.search(p, q) for p in SIMPLE_PATTERNS):
        return "simple"
    if any(s in q for s in HARD_SIGNALS) or len(q) > 1500:
        return "hard"
    return None  # let the small classifier decide
 
 
def llm_classify(query: str) -> str:
    """Tiny-model fallback when heuristics are uncertain."""
    response = client.messages.create(
        model="claude-haiku-4-5",        # ~$1 per million tokens
        max_tokens=10,
        system=(
            "Classify the user query into exactly one of: simple, medium, hard. "
            "simple = lookup, ack, short rewrite. "
            "medium = summarization, extraction, light reasoning. "
            "hard = multi-step planning, novel synthesis, code generation. "
            "Reply with only the single word."
        ),
        messages=[{"role": "user", "content": query}],
    )
    label = response.content[0].text.strip().lower()
    return label if label in {"simple", "medium", "hard"} else "medium"
 
 
def classify(query: str) -> str:
    return fast_classify(query) or llm_classify(query)

The heuristic layer answers ~70% of classification calls for free. The remaining 30% pay one Haiku call each — adding under a tenth of a cent per request to save dollars on the routing decision.

Wiring the Cascade

TIERS = {
    "simple":  "claude-haiku-4-5",       # $0.80 / 1M input
    "medium":  "claude-sonnet-4-6",      # $3.00 / 1M input
    "hard":    "claude-opus-4-7",        # $15.00 / 1M input
}
 
ESCALATION = {"simple": "medium", "medium": "hard", "hard": None}
 
 
async def route_and_complete(query: str, system: str) -> dict:
    tier = classify(query)
    attempts = []
    while tier:
        model = TIERS[tier]
        response = await acomplete(model, system, query)
        attempts.append({"tier": tier, "model": model, "tokens": response.usage})
        if validator_passes(query, response.text, model=TIERS["simple"]):
            return {"text": response.text, "attempts": attempts, "final_tier": tier}
        tier = ESCALATION[tier]
    # All tiers tried; return the best we got
    return {"text": response.text, "attempts": attempts, "final_tier": "hard"}

The Self-Check Validator

The validator is what makes the cascade safe. Without it, a small model that confidently produces a wrong answer goes uncaught. With it, the gateway can ship aggressive routing and trust the safety net.

def validator_passes(query: str, answer: str, model: str) -> bool:
    """LLM-as-judge — cheap confidence check, returns True/False."""
    response = client.messages.create(
        model=model,
        max_tokens=20,
        system=(
            "You are a confidence checker. Read the user query and the proposed answer. "
            "Reply with CONFIDENT if the answer fully and correctly addresses the query. "
            "Reply with UNSURE if the answer is incomplete, off-topic, hedged, or might be wrong."
        ),
        messages=[{
            "role": "user",
            "content": f"Query:\n{query}\n\nProposed answer:\n{answer}"
        }],
    )
    return response.content[0].text.strip().upper().startswith("CONFIDENT")

Tune the validator on your traffic

Run the validator in shadow mode first — log its verdicts for a week without acting on them. Compare against human review on a sample. Adjust the system prompt until false-pass rate is under 3% on your traffic. Only then enable escalation. Without this calibration step, the validator either over-escalates (defeating the cost savings) or under-escalates (shipping wrong answers).

Pattern 2: Semantic Caching

Exact-string caching catches almost nothing in production because humans phrase the same question a dozen ways. Semantic caching matches by meaning. The query is embedded; the gateway searches a vector index of recent queries; if any cached query is "close enough" by cosine similarity, the cached response is returned.

Semantic caching matches by meaning. In production, hit rates of 35-60% are typical for FAQ-style traffic — every hit is a request your LLM never has to compute.

A Working Implementation with pgvector

import asyncpg
import numpy as np
from openai import AsyncOpenAI
 
oai = AsyncOpenAI()
 
async def embed(text: str) -> list[float]:
    r = await oai.embeddings.create(
        model="text-embedding-3-small",   # $0.02 / 1M tokens
        input=text,
    )
    return r.data[0].embedding
 
async def cache_lookup(pool, query: str, threshold: float = 0.92) -> str | None:
    vec = await embed(query)
    # pgvector cosine distance: 0 = identical, 2 = opposite
    async with pool.acquire() as conn:
        row = await conn.fetchrow(
            """
            SELECT response, 1 - (embedding <=> $1) AS similarity
            FROM llm_cache
            WHERE expires_at > NOW()
            ORDER BY embedding <=> $1
            LIMIT 1
            """,
            vec,
        )
    if row and row["similarity"] >= threshold:
        return row["response"]
    return None
 
async def cache_store(pool, query: str, response: str, ttl_seconds: int = 86400):
    vec = await embed(query)
    async with pool.acquire() as conn:
        await conn.execute(
            """
            INSERT INTO llm_cache (query, embedding, response, expires_at)
            VALUES ($1, $2, $3, NOW() + ($4 || ' seconds')::interval)
            """,
            query, vec, response, str(ttl_seconds),
        )

Schema:

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE llm_cache (
  id BIGSERIAL PRIMARY KEY,
  query TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  response TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL
);
 
CREATE INDEX llm_cache_embedding_idx
  ON llm_cache USING hnsw (embedding vector_cosine_ops);
 
CREATE INDEX llm_cache_expires_idx ON llm_cache (expires_at);

Threshold and TTL — the Two Knobs That Matter

The similarity threshold is the bias-variance tradeoff of caching:

0.97-0.99 — only near-identical paraphrases hit. Safe for legal, medical, financial answers where any drift is unacceptable. Hit rates: 10-20%.
0.92-0.95 — sweet spot for most product traffic. Catches paraphrases without conflating different questions. Hit rates: 35-50%.
0.85-0.90 — aggressive. Use for low-stakes content (marketing, summaries, rewrites). Hit rates: 50-70%, but expect occasional "close but wrong" matches.

TTL by content type:

Content	TTL	Why
Static FAQ ("what is your refund policy")	30 days	Policy rarely changes
Product info	24 hours	Catalog updates daily
User-specific data	Do not cache	Cross-user leakage risk
Time-sensitive ("latest news")	5 minutes or skip	Stale answers are wrong

Never cache user-scoped content in a shared cache

If your cache is shared across users, you must either (a) include the user ID in the cache key/embedding, or (b) skip caching for any query that could return user-specific data. The single most expensive bug in semantic caching is showing User A the cached answer that was generated for User B. Use a per-user cache namespace or a query allowlist.

Pattern 3: Budget Guards and Failover

Cost optimization without spend ceilings is a coin flip. The next runaway loop, prompt injection, or pricing change can erase a quarter of savings in a day. The gateway needs hard limits.

from datetime import datetime
from decimal import Decimal
 
async def check_budget(pool, user_id: str, estimated_cost: Decimal) -> bool:
    """Return True if the request fits the user's daily/monthly budget."""
    async with pool.acquire() as conn:
        spent_today, spent_month = await conn.fetchrow(
            """
            SELECT
              COALESCE(SUM(cost) FILTER (WHERE day = CURRENT_DATE), 0)   AS today,
              COALESCE(SUM(cost) FILTER (WHERE day >= date_trunc('month', CURRENT_DATE)), 0) AS month
            FROM llm_spend WHERE user_id = $1
            """,
            user_id,
        )
    daily_limit = Decimal("5.00")
    monthly_limit = Decimal("100.00")
    return (
        spent_today + estimated_cost <= daily_limit
        and spent_month + estimated_cost <= monthly_limit
    )
 
async def record_spend(pool, user_id: str, model: str, cost: Decimal):
    async with pool.acquire() as conn:
        await conn.execute(
            "INSERT INTO llm_spend (user_id, day, model, cost) VALUES ($1, CURRENT_DATE, $2, $3)",
            user_id, model, cost,
        )

Failover is the other half of governance. When the primary provider is down or slow, the gateway routes to a secondary — typically the same tier from a different vendor.

from tenacity import retry, stop_after_attempt, wait_exponential
 
PROVIDER_FALLBACKS = {
    "claude-haiku-4-5":   ["gpt-5-mini", "gemma-4-9b-instruct"],
    "claude-sonnet-4-6":  ["gpt-5", "gemini-2-5-flash"],
    "claude-opus-4-7":    ["gpt-5-pro", "gemini-2-5-pro"],
}
 
@retry(stop=stop_after_attempt(2), wait=wait_exponential(min=0.2, max=2))
async def acomplete_with_failover(model: str, system: str, query: str):
    try:
        return await acomplete(model, system, query, timeout=8.0)
    except (TimeoutError, ProviderError) as e:
        for fallback in PROVIDER_FALLBACKS.get(model, []):
            try:
                return await acomplete(fallback, system, query, timeout=8.0)
            except (TimeoutError, ProviderError):
                continue
        raise e

The Production Architecture

Putting it all together, this is what a hardened LLM Gateway looks like in production.

Reference architecture: web app and mobile clients send traffic through Auth + Rate Limit into the LLM Gateway, which composes Semantic Cache, Classifier, Router, Budget Guard, Failover, and PII Filter modules, all instrumented with OpenTelemetry traces and Prometheus metrics; the gateway dispatches to Tier 1 small models, Tier 2 mid models, Tier 3 flagship models, or self-hosted vLLM via a provider-agnostic layer; stateful components include a vector cache, spend ledger, trace store, and eval bench. — Every production LLM gateway needs all six modules — caching, classification, routing, budget, failover, PII redaction — plus an observability spine that ties spend, latency, and quality together.

The FastAPI Skeleton

Here is the entire request path, in 60 lines, ready to extend:

from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from decimal import Decimal
import asyncpg
import time
 
app = FastAPI()
 
class CompleteRequest(BaseModel):
    query: str
    system: str | None = None
    user_id: str
    cache_ttl: int = 86400
    cache_threshold: float = 0.92
 
@app.post("/v1/complete")
async def complete(req: CompleteRequest, request: Request):
    pool = request.app.state.pool
    started = time.perf_counter()
 
    # 1. Budget guard (estimate ~2k tokens worst case)
    if not await check_budget(pool, req.user_id, Decimal("0.05")):
        raise HTTPException(429, "Daily budget exceeded")
 
    # 2. Semantic cache lookup
    if cached := await cache_lookup(pool, req.query, req.cache_threshold):
        await emit_metric("gateway.cache.hit", user=req.user_id)
        return {"text": cached, "cached": True, "latency_ms": ms(started)}
 
    # 3. Classify + route + validate (cascade)
    result = await route_and_complete(req.query, req.system or "")
 
    # 4. Persist + record spend + emit traces
    await cache_store(pool, req.query, result["text"], ttl_seconds=req.cache_ttl)
    cost = compute_cost(result["attempts"])
    await record_spend(pool, req.user_id, result["final_tier"], cost)
    await emit_trace(req, result, cost, ms(started))
 
    return {
        "text": result["text"],
        "cached": False,
        "tier": result["final_tier"],
        "cost_usd": float(cost),
        "latency_ms": ms(started),
    }
 
def ms(t0: float) -> int:
    return int((time.perf_counter() - t0) * 1000)

Observability — the Metrics That Matter

Without metrics, you cannot tune the gateway. With them, you find the next 20% of savings every week. Track these at minimum:

Metric	What it tells you	Action threshold
`gateway.cache.hit_rate`	Is semantic caching working?	< 25% → loosen threshold or check write-back
`gateway.tier.distribution`	Where is your spend going?	> 30% on Tier 3 → classifier is over-escalating
`gateway.escalation.rate`	How often is the cascade falling through?	> 15% → small model is not strong enough for your traffic
`gateway.cost_per_request`	The headline number	Track per route + per user segment
`gateway.p95_latency_ms`	Cascade adds latency on misses	> 2× baseline → consider parallel speculation
`gateway.budget.rejection_rate`	Are users hitting limits?	> 1% → review per-user limits

Five Real Use Cases You Can Ship This Week

The patterns above are general. These are the use cases where they pay off fastest.

Use Case 1 — Customer Support Copilot

The setup. A SaaS support widget answers user questions using your help docs as context. You were sending every query to GPT-5 Pro because some questions need careful reasoning.

The fix. Heuristic classifier sends "what's your refund policy" / "how do I reset my password" to Haiku with the relevant doc snippet. Sonnet handles "I'm seeing X error after step Y, what's wrong." Opus reserved for multi-issue tickets and angry-customer escalations.

Numbers from a real deployment (12k queries/day, mid-market SaaS): cost per query dropped from $0.018 to $0.0031 — 83% reduction. Cache hit rate on FAQ-style queries hit 47% within two weeks.

Use Case 2 — Code Review Bot for a Monorepo

The setup. A GitHub Action runs Opus on every PR diff. Costs $8,000/month and growing.

The fix. Cascade by diff size and file type. Diffs < 200 lines or touching only tests/docs go to Sonnet. Diffs touching critical paths (auth, billing, schemas) go to Opus regardless of size. Validator escalates if Sonnet flags any "I'm not sure" reasoning. Semantic cache keyed on (filename + diff) catches re-runs of the same PR.

Result. Same precision/recall on a 200-PR holdout set. $8,000/month → $1,650/month. The validator catches the ~6% of diffs where Sonnet would have missed something subtle.

Use Case 3 — Bulk Content Generation Pipeline

The setup. Marketing platform generates product descriptions, social posts, and email subject lines. 50,000 generations/day on GPT-5.

The fix. This is where small models eat flagship lunch. Haiku 4.5 produces product descriptions that A/B-test identically to GPT-5 output for 90% of categories. Opus is reserved for hero copy and brand-sensitive launches. Aggressive semantic caching (threshold 0.88) on template-driven content lifts hit rates above 60%.

Result. $14k/month → $2.1k/month. Latency dropped from 4.2s to 0.9s p95 — Haiku is faster, and cache hits are instant.

Use Case 4 — RAG Pipeline for Internal Knowledge Base

The setup. Employees query an internal KB ("how do I expense international travel?"). Each query embeds + retrieves + generates with Sonnet.

The fix. The semantic cache layer is decisive here because corporate KB queries are dominated by a few hundred high-frequency questions. With a 30-day TTL and threshold 0.94, hit rate climbs to 58% within a quarter. Routing sends "what does X mean" lookups to Haiku and reserves Sonnet for "compare policy A and B for situation C."

Result. Cost down 71%. The bigger win: latency on cache hits drops from 1,800ms to 80ms — the KB now feels like search, not a chatbot.

Use Case 5 — Agent Loops That Were Bankrupting You

The setup. Your LangGraph agent retries on validation failures. One bad input puts it in a 40-iteration loop at $0.30 per iteration.

The fix. The budget guard is the unsung hero of agent reliability. Per-trace spend ceilings ($1 max per agent run) hard-stop runaways. Per-user daily limits ($5/day) prevent abuse. The router still works inside the agent — the planner uses Sonnet, the worker tools use Haiku, and only the synthesizer touches Opus.

Result. Eliminated all five "surprise $400 nights" the team had hit in the previous quarter. Average agent run cost dropped from $0.62 to $0.11.

The Decision Framework — When to Build vs. Buy

You have three options for getting an LLM gateway in production:

Option	Pros	Cons	When to choose
Roll your own (this guide)	Total control · pick your stack · no per-request fees	1-2 weeks of engineering · ongoing maintenance	You have a strong infra team and high volume
Open-source (LiteLLM, Portkey OSS, Helicone)	Routing + observability free · battle-tested	Caching often DIY · self-host or trust third party	Most teams — fastest path to 80% of the value
Hosted gateway (OpenRouter, Portkey Cloud, Cloudflare AI Gateway)	Zero ops · multi-provider · usage analytics built in	Per-request markup · data leaves your VPC	Pre-revenue or compliance-flexible products

Practical recommendation: start with LiteLLM for routing + a small custom semantic-cache layer in front of it. This gets you 70% of the savings in a week. Build the rest of the stack only after you have metrics showing where the next 20% lives.

When You Should NOT Build a Gateway

The pattern is not free. Skip it (for now) when:

Your daily LLM spend is < $50. The engineering and operational cost will exceed the savings. Revisit at $200/day.
Every query genuinely needs flagship reasoning — e.g., legal contract analysis, complex code generation. Routing has nothing to optimize.
Strict deterministic requirements (medical, financial advice). Semantic caching can return a near-match that's slightly wrong; that's unacceptable here. Use exact-match caching only.
You are still iterating on the core prompt. Add the gateway after the prompt stabilizes, not before — otherwise cached responses become technical debt every time you change the system prompt.

What's Next

The LLM gateway is the most boring, highest-ROI piece of infrastructure you can build for an AI-backed product. Three trends are pushing it from "advanced pattern" to "default architecture" through 2026:

Speculative routing. The gateway sends a query to Tier 1 and Tier 2 in parallel, returns Tier 1 if its validator passes within 200ms, otherwise switches to Tier 2's stream. Cuts p95 latency without giving up cost savings.
Embedded distillation feedback loops. Every escalation from Tier 1 → Tier 2 is logged as training data. Periodic fine-tunes of the small model close the gap; tier mix shifts down on its own.
Provider-aware routing. Real-time pricing, latency, and outage signals feed the router. When OpenAI raises prices or Anthropic has a regional incident, traffic reroutes within seconds — invisibly to your application.

If you read this far, you have the patterns and the code. The gateway is a weekend project that will pay for itself by next Friday's invoice. Ship it.

References and Further Reading

LiteLLM — Unified LLM Gateway — Open-source proxy supporting 100+ LLM providers with built-in routing, caching, and budget controls.
Portkey AI Gateway — Production-grade gateway with semantic caching, fallback chains, and unified observability.
Helicone — Observability for LLM Apps — Open-source LLM observability platform covering cost, latency, and prompt experiments.
Cloudflare AI Gateway — Zero-ops gateway at the edge with caching, rate limits, and analytics.
OpenRouter — Provider-agnostic LLM routing with real-time price and availability signals.
pgvector — Open-source Vector Search for Postgres — Adds the vector type and HNSW indexing used in the semantic cache implementation.
Anthropic Pricing — Claude API — Current per-token pricing for Haiku, Sonnet, and Opus model tiers.
OpenAI Embeddings Documentation — Reference for text-embedding-3-small used in the semantic cache pipeline.
Tenacity — Retry Library for Python — Provides the exponential backoff used in the failover code path.
LLM Cost Optimization Patterns — Microsoft Azure Architecture Center — Microsoft's reference patterns for LLM cost reduction in enterprise deployments.

You shipped your first LLM feature in January. Cost: $400 a month. By March it was $4,000. This week your CFO forwarded the bill with a single word: "explain."

This guide shows you exactly how to build one. Real code. Real numbers. Real use cases you can ship this week.

Why Your AI Bill Is Exploding

Three structural problems compound to drive LLM costs up and to the right:

1. The Flagship-Tax Problem

2. The Duplicate-Question Problem

3. The Runaway-Cost Problem

The numbers, quantified

The LLM Gateway Pattern

Cache — return a stored response if the new query is semantically equivalent to a recent one
Classify — assess whether the query is simple, medium, or hard
Route — pick the cheapest model that can answer it correctly
Validate — cheaply check the response and escalate to a stronger model on low confidence
Govern — enforce per-user budget limits, redact PII, log structured traces

Pattern 1: The Routing Cascade

The routing cascade is the heart of the gateway. Try the cheapest model that could plausibly answer. Validate the answer cheaply. Escalate only if validation fails.

Building the Classifier

The classifier is the smartest cost lever in the entire gateway. Get it right and 60% of your traffic stops paying flagship prices. Get it wrong and you ship a worse product than before.

The trick: do not use the flagship model to classify. That defeats the point. Use heuristics first, then a small model second.

import re
from anthropic import Anthropic
 
client = Anthropic()
 
# Cheap, deterministic heuristics — run these first
SIMPLE_PATTERNS = [
    r"^(what|when|where|who|how much|how many)\b.{0,80}\?$",  # short factoid
    r"^(yes|no|okay|thanks|thank you|cancel)\b",              # acks
    r"^/(help|status|reset|clear)\b",                          # commands
]
 
HARD_SIGNALS = [
    "step by step", "plan", "strategy", "design", "architecture",
    "compare and contrast", "analyze", "tradeoffs", "implications",
    "write code", "implement", "refactor", "debug",
]
 
def fast_classify(query: str) -> str | None:
    """Return 'simple' / 'hard' / None (=unsure, ask the small model)."""
    q = query.strip().lower()
    if len(q) < 60 and any(re.search(p, q) for p in SIMPLE_PATTERNS):
        return "simple"
    if any(s in q for s in HARD_SIGNALS) or len(q) > 1500:
        return "hard"
    return None  # let the small classifier decide
 
 
def llm_classify(query: str) -> str:
    """Tiny-model fallback when heuristics are uncertain."""
    response = client.messages.create(
        model="claude-haiku-4-5",        # ~$1 per million tokens
        max_tokens=10,
        system=(
            "Classify the user query into exactly one of: simple, medium, hard. "
            "simple = lookup, ack, short rewrite. "
            "medium = summarization, extraction, light reasoning. "
            "hard = multi-step planning, novel synthesis, code generation. "
            "Reply with only the single word."
        ),
        messages=[{"role": "user", "content": query}],
    )
    label = response.content[0].text.strip().lower()
    return label if label in {"simple", "medium", "hard"} else "medium"
 
 
def classify(query: str) -> str:
    return fast_classify(query) or llm_classify(query)

The heuristic layer answers ~70% of classification calls for free. The remaining 30% pay one Haiku call each — adding under a tenth of a cent per request to save dollars on the routing decision.

Wiring the Cascade

TIERS = {
    "simple":  "claude-haiku-4-5",       # $0.80 / 1M input
    "medium":  "claude-sonnet-4-6",      # $3.00 / 1M input
    "hard":    "claude-opus-4-7",        # $15.00 / 1M input
}
 
ESCALATION = {"simple": "medium", "medium": "hard", "hard": None}
 
 
async def route_and_complete(query: str, system: str) -> dict:
    tier = classify(query)
    attempts = []
    while tier:
        model = TIERS[tier]
        response = await acomplete(model, system, query)
        attempts.append({"tier": tier, "model": model, "tokens": response.usage})
        if validator_passes(query, response.text, model=TIERS["simple"]):
            return {"text": response.text, "attempts": attempts, "final_tier": tier}
        tier = ESCALATION[tier]
    # All tiers tried; return the best we got
    return {"text": response.text, "attempts": attempts, "final_tier": "hard"}

The Self-Check Validator

def validator_passes(query: str, answer: str, model: str) -> bool:
    """LLM-as-judge — cheap confidence check, returns True/False."""
    response = client.messages.create(
        model=model,
        max_tokens=20,
        system=(
            "You are a confidence checker. Read the user query and the proposed answer. "
            "Reply with CONFIDENT if the answer fully and correctly addresses the query. "
            "Reply with UNSURE if the answer is incomplete, off-topic, hedged, or might be wrong."
        ),
        messages=[{
            "role": "user",
            "content": f"Query:\n{query}\n\nProposed answer:\n{answer}"
        }],
    )
    return response.content[0].text.strip().upper().startswith("CONFIDENT")

Tune the validator on your traffic

Pattern 2: Semantic Caching

Semantic caching matches by meaning. In production, hit rates of 35-60% are typical for FAQ-style traffic — every hit is a request your LLM never has to compute.

A Working Implementation with pgvector

import asyncpg
import numpy as np
from openai import AsyncOpenAI
 
oai = AsyncOpenAI()
 
async def embed(text: str) -> list[float]:
    r = await oai.embeddings.create(
        model="text-embedding-3-small",   # $0.02 / 1M tokens
        input=text,
    )
    return r.data[0].embedding
 
async def cache_lookup(pool, query: str, threshold: float = 0.92) -> str | None:
    vec = await embed(query)
    # pgvector cosine distance: 0 = identical, 2 = opposite
    async with pool.acquire() as conn:
        row = await conn.fetchrow(
            """
            SELECT response, 1 - (embedding <=> $1) AS similarity
            FROM llm_cache
            WHERE expires_at > NOW()
            ORDER BY embedding <=> $1
            LIMIT 1
            """,
            vec,
        )
    if row and row["similarity"] >= threshold:
        return row["response"]
    return None
 
async def cache_store(pool, query: str, response: str, ttl_seconds: int = 86400):
    vec = await embed(query)
    async with pool.acquire() as conn:
        await conn.execute(
            """
            INSERT INTO llm_cache (query, embedding, response, expires_at)
            VALUES ($1, $2, $3, NOW() + ($4 || ' seconds')::interval)
            """,
            query, vec, response, str(ttl_seconds),
        )

Schema:

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE llm_cache (
  id BIGSERIAL PRIMARY KEY,
  query TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  response TEXT NOT NULL,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ NOT NULL
);
 
CREATE INDEX llm_cache_embedding_idx
  ON llm_cache USING hnsw (embedding vector_cosine_ops);
 
CREATE INDEX llm_cache_expires_idx ON llm_cache (expires_at);

Threshold and TTL — the Two Knobs That Matter

The similarity threshold is the bias-variance tradeoff of caching:

0.97-0.99 — only near-identical paraphrases hit. Safe for legal, medical, financial answers where any drift is unacceptable. Hit rates: 10-20%.
0.92-0.95 — sweet spot for most product traffic. Catches paraphrases without conflating different questions. Hit rates: 35-50%.
0.85-0.90 — aggressive. Use for low-stakes content (marketing, summaries, rewrites). Hit rates: 50-70%, but expect occasional "close but wrong" matches.

TTL by content type:

Content	TTL	Why
Static FAQ ("what is your refund policy")	30 days	Policy rarely changes
Product info	24 hours	Catalog updates daily
User-specific data	Do not cache	Cross-user leakage risk
Time-sensitive ("latest news")	5 minutes or skip	Stale answers are wrong

Never cache user-scoped content in a shared cache

Pattern 3: Budget Guards and Failover

Cost optimization without spend ceilings is a coin flip. The next runaway loop, prompt injection, or pricing change can erase a quarter of savings in a day. The gateway needs hard limits.

from datetime import datetime
from decimal import Decimal
 
async def check_budget(pool, user_id: str, estimated_cost: Decimal) -> bool:
    """Return True if the request fits the user's daily/monthly budget."""
    async with pool.acquire() as conn:
        spent_today, spent_month = await conn.fetchrow(
            """
            SELECT
              COALESCE(SUM(cost) FILTER (WHERE day = CURRENT_DATE), 0)   AS today,
              COALESCE(SUM(cost) FILTER (WHERE day >= date_trunc('month', CURRENT_DATE)), 0) AS month
            FROM llm_spend WHERE user_id = $1
            """,
            user_id,
        )
    daily_limit = Decimal("5.00")
    monthly_limit = Decimal("100.00")
    return (
        spent_today + estimated_cost <= daily_limit
        and spent_month + estimated_cost <= monthly_limit
    )
 
async def record_spend(pool, user_id: str, model: str, cost: Decimal):
    async with pool.acquire() as conn:
        await conn.execute(
            "INSERT INTO llm_spend (user_id, day, model, cost) VALUES ($1, CURRENT_DATE, $2, $3)",
            user_id, model, cost,
        )

Failover is the other half of governance. When the primary provider is down or slow, the gateway routes to a secondary — typically the same tier from a different vendor.

from tenacity import retry, stop_after_attempt, wait_exponential
 
PROVIDER_FALLBACKS = {
    "claude-haiku-4-5":   ["gpt-5-mini", "gemma-4-9b-instruct"],
    "claude-sonnet-4-6":  ["gpt-5", "gemini-2-5-flash"],
    "claude-opus-4-7":    ["gpt-5-pro", "gemini-2-5-pro"],
}
 
@retry(stop=stop_after_attempt(2), wait=wait_exponential(min=0.2, max=2))
async def acomplete_with_failover(model: str, system: str, query: str):
    try:
        return await acomplete(model, system, query, timeout=8.0)
    except (TimeoutError, ProviderError) as e:
        for fallback in PROVIDER_FALLBACKS.get(model, []):
            try:
                return await acomplete(fallback, system, query, timeout=8.0)
            except (TimeoutError, ProviderError):
                continue
        raise e

The Production Architecture

Putting it all together, this is what a hardened LLM Gateway looks like in production.

The FastAPI Skeleton

Here is the entire request path, in 60 lines, ready to extend:

from fastapi import FastAPI, HTTPException, Request
from pydantic import BaseModel
from decimal import Decimal
import asyncpg
import time
 
app = FastAPI()
 
class CompleteRequest(BaseModel):
    query: str
    system: str | None = None
    user_id: str
    cache_ttl: int = 86400
    cache_threshold: float = 0.92
 
@app.post("/v1/complete")
async def complete(req: CompleteRequest, request: Request):
    pool = request.app.state.pool
    started = time.perf_counter()
 
    # 1. Budget guard (estimate ~2k tokens worst case)
    if not await check_budget(pool, req.user_id, Decimal("0.05")):
        raise HTTPException(429, "Daily budget exceeded")
 
    # 2. Semantic cache lookup
    if cached := await cache_lookup(pool, req.query, req.cache_threshold):
        await emit_metric("gateway.cache.hit", user=req.user_id)
        return {"text": cached, "cached": True, "latency_ms": ms(started)}
 
    # 3. Classify + route + validate (cascade)
    result = await route_and_complete(req.query, req.system or "")
 
    # 4. Persist + record spend + emit traces
    await cache_store(pool, req.query, result["text"], ttl_seconds=req.cache_ttl)
    cost = compute_cost(result["attempts"])
    await record_spend(pool, req.user_id, result["final_tier"], cost)
    await emit_trace(req, result, cost, ms(started))
 
    return {
        "text": result["text"],
        "cached": False,
        "tier": result["final_tier"],
        "cost_usd": float(cost),
        "latency_ms": ms(started),
    }
 
def ms(t0: float) -> int:
    return int((time.perf_counter() - t0) * 1000)

Observability — the Metrics That Matter

Without metrics, you cannot tune the gateway. With them, you find the next 20% of savings every week. Track these at minimum:

Metric	What it tells you	Action threshold
`gateway.cache.hit_rate`	Is semantic caching working?	< 25% → loosen threshold or check write-back
`gateway.tier.distribution`	Where is your spend going?	> 30% on Tier 3 → classifier is over-escalating
`gateway.escalation.rate`	How often is the cascade falling through?	> 15% → small model is not strong enough for your traffic
`gateway.cost_per_request`	The headline number	Track per route + per user segment
`gateway.p95_latency_ms`	Cascade adds latency on misses	> 2× baseline → consider parallel speculation
`gateway.budget.rejection_rate`	Are users hitting limits?	> 1% → review per-user limits

Option	Pros	Cons	When to choose
Roll your own (this guide)	Total control · pick your stack · no per-request fees	1-2 weeks of engineering · ongoing maintenance	You have a strong infra team and high volume
Open-source (LiteLLM, Portkey OSS, Helicone)	Routing + observability free · battle-tested	Caching often DIY · self-host or trust third party	Most teams — fastest path to 80% of the value
Hosted gateway (OpenRouter, Portkey Cloud, Cloudflare AI Gateway)	Zero ops · multi-provider · usage analytics built in	Per-request markup · data leaves your VPC	Pre-revenue or compliance-flexible products

When You Should NOT Build a Gateway

The pattern is not free. Skip it (for now) when:

Your daily LLM spend is < $50. The engineering and operational cost will exceed the savings. Revisit at $200/day.
Every query genuinely needs flagship reasoning — e.g., legal contract analysis, complex code generation. Routing has nothing to optimize.
Strict deterministic requirements (medical, financial advice). Semantic caching can return a near-match that's slightly wrong; that's unacceptable here. Use exact-match caching only.
You are still iterating on the core prompt. Add the gateway after the prompt stabilizes, not before — otherwise cached responses become technical debt every time you change the system prompt.

What's Next

Speculative routing. The gateway sends a query to Tier 1 and Tier 2 in parallel, returns Tier 1 if its validator passes within 200ms, otherwise switches to Tier 2's stream. Cuts p95 latency without giving up cost savings.
Embedded distillation feedback loops. Every escalation from Tier 1 → Tier 2 is logged as training data. Periodic fine-tunes of the small model close the gap; tier mix shifts down on its own.
Provider-aware routing. Real-time pricing, latency, and outage signals feed the router. When OpenAI raises prices or Anthropic has a regional incident, traffic reroutes within seconds — invisibly to your application.

If you read this far, you have the patterns and the code. The gateway is a weekend project that will pay for itself by next Friday's invoice. Ship it.

References and Further Reading

LiteLLM — Unified LLM Gateway — Open-source proxy supporting 100+ LLM providers with built-in routing, caching, and budget controls.
Portkey AI Gateway — Production-grade gateway with semantic caching, fallback chains, and unified observability.
Helicone — Observability for LLM Apps — Open-source LLM observability platform covering cost, latency, and prompt experiments.
Cloudflare AI Gateway — Zero-ops gateway at the edge with caching, rate limits, and analytics.
OpenRouter — Provider-agnostic LLM routing with real-time price and availability signals.
pgvector — Open-source Vector Search for Postgres — Adds the vector type and HNSW indexing used in the semantic cache implementation.
Anthropic Pricing — Claude API — Current per-token pricing for Haiku, Sonnet, and Opus model tiers.
OpenAI Embeddings Documentation — Reference for text-embedding-3-small used in the semantic cache pipeline.
Tenacity — Retry Library for Python — Provides the exponential backoff used in the failover code path.
LLM Cost Optimization Patterns — Microsoft Azure Architecture Center — Microsoft's reference patterns for LLM cost reduction in enterprise deployments.

Related Posts

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

AI Agents Keep Dying in Production. The Fix Was Invented in 1986.

MCP, Agents, Skills, Subagents: The Definitive Guide to AI's New Building Blocks

Comments

Leave a comment

Related Posts

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

AI Agents Keep Dying in Production. The Fix Was Invented in 1986.

MCP, Agents, Skills, Subagents: The Definitive Guide to AI's New Building Blocks

Comments

Leave a comment