The KV Cache Is a Database. You're Probably Treating It Like a Buffer.
Every modern inference engine has a database hiding inside it — a content-addressable, multi-tenant, hierarchically-tiered store with eviction policies, locality of reference, and a 100× cost gap between hits and misses. Most teams ignore it. The ones who don't run inference at a fraction of the cost.
Most production LLM bills have a line item nobody reads: HBM utilization. It is the largest hidden cost in inference, and it is not the model weights. It is the KV cache — the per-request, per-token, per-layer activation store that grows linearly with context length and dominates GPU memory after a few thousand tokens.
The KV cache is treated, in most teams, like a buffer. Allocate it on request start. Free it on request end. Sometimes share a prefix if the framework happens to support it. This is the equivalent of treating a database as a chunk of scratch RAM you free at the end of each query — and it leaves a 5-10× cost reduction on the table for any team running non-trivial inference traffic.
This article makes a single argument: the KV cache is a database. It is content-addressable, hierarchically tiered, multi-tenant, and has a hit-rate curve as steep as any production cache you have ever tuned. The systems-engineering vocabulary that built Memcached, Redis, RocksDB, and modern columnar stores transfers directly. Once you start treating the KV cache like a real storage system — with eviction policies, replication, tiering, and a measurable hit rate — the production wins are immediate and large.
We will derive why every decoder forward pass is fundamentally a cache lookup, walk through the storage hierarchy that modern engines (vLLM, SGLang, TensorRT-LLM, Mooncake, Dynamo) have already built whether or not you have noticed, and end with the small set of operational metrics that decide whether your inference stack is running at 2× the cost it should be.
The Cache Lookup at the Heart of Every Token
A transformer decoder, for each output token, computes attention over every previous token in the context. Naively, this requires re-running the keys and values for the entire prefix on every step — quadratic work that no production system performs. Instead, on the first forward pass over a prompt (the prefill), the engine computes the keys K and values V for every prefix position once, and stores them. Every subsequent decode step only computes K and V for the new token, appends them to the stored tensors, and performs attention against the cached K and V.
The stored tensors are the KV cache. They are, mathematically, the memo of every conditional computation the model has already done on this prefix. They are exactly the same shape regardless of which user is asking, which application is running, or which day it is. If two requests share a prefix, they share the corresponding KV cache entries. This is the entire premise of prefix caching, and it is also the entire premise of every database that has ever been written: identical inputs should not be recomputed.
The math on KV cache size is unkind. For a 70B-parameter Llama-style model with 64 layers, 8 KV heads (grouped-query attention), head dimension 128, and FP16 storage:
bytes_per_token = 2 (K and V) × n_layers × n_kv_heads × head_dim × 2 (FP16)
= 2 × 64 × 8 × 128 × 2
= 262,144 bytes ≈ 256 KB per token
At 32K context, a single request occupies ~8 GB of KV cache. On an H200 with 141 GB of HBM, after subtracting ~70 GB for FP8 weights, you have room for about eight concurrent 32K requests. At 128K context per request, you have room for two. The KV cache, not the model weights, is what limits concurrency on modern serving hardware. This is the central fact that makes it a storage system rather than a buffer.
The number that decides whether you have a KV cache problem
Compute (HBM available − weights_size) / (kv_bytes_per_token × P95 context length). That is your hard concurrency ceiling per replica. If your traffic exceeds it, requests queue, latency spikes, and you scale by adding GPUs you would not need if your KV cache layer were doing its job.
The Storage Hierarchy You Already Have
Pull up the architecture of any modern inference engine and you find — whether the developers framed it this way or not — a multi-level storage hierarchy that looks identical to the memory hierarchy of any operating system. The tiers in 2026 are:
-
L0 — Active SRAM tensors during a single forward pass. Bandwidth: ~20 TB/s on H200. Capacity: ~256 MB. Lifetime: microseconds. You do not manage this directly; the kernel scheduler does.
-
L1 — Working KV cache in HBM. Bandwidth: ~4.8 TB/s. Capacity: tens of GB per GPU. This is where active request KV lives.
-
L2 — Offloaded KV in CPU DRAM. Bandwidth: ~50 GB/s over PCIe Gen5, ~900 GB/s over NVLink/Grace. Capacity: hundreds of GB per node. Where paused or low-priority request KV gets demoted.
-
L3 — Persistent KV on local NVMe. Bandwidth: ~14 GB/s (PCIe Gen5 SSD). Capacity: terabytes. Where long-lived prefixes (system prompts, retrieved documents) get parked for warm restart.
-
L4 — Distributed/disaggregated KV across a cluster. Bandwidth: limited by the data-center network (200-800 Gbps with InfiniBand or RoCE). Capacity: petabytes if you want it. This is where Mooncake and NVIDIA Dynamo live.
The cost gap across these tiers is enormous and follows familiar ratios: an L1 (HBM) KV hit costs roughly 100 ns of latency. An L2 (CPU) hit costs 10-100 μs. An L3 (NVMe) hit costs 100 μs - 1 ms. An L4 (remote) hit costs 1-10 ms plus network jitter. Compared to recomputing the prefix from scratch — which for a 32K prompt on a 70B model costs roughly 2 seconds of GPU time and dominates first-token latency — every tier of cache hit is a massive win.
The right mental model: this is not "GPU memory and a fallback." This is a five-tier storage hierarchy with cost ratios of roughly 1 : 10 : 1000 : 10000 : 100000 between L0 and L4. Your job as a systems engineer is the same job DBAs have done for forty years — keep the working set in the fastest tier that can hold it, and pay attention to which keys deserve to be promoted or demoted.
Paged Attention Is Virtual Memory. Literally.
The single most important systems contribution in LLM serving in the last three years was Kwon et al.'s PagedAttention (vLLM, SOSP 2023). The framing in the paper is technical, but the analogy is exact: paged attention is the virtual memory subsystem of LLM serving.
In a naive serving system, you allocate a contiguous KV cache slab of size max_context_length × bytes_per_token per request. If the request only uses 200 tokens of a 32K-token allocation, the remaining ~99.4% of that slab is wasted. Total HBM utilization in vLLM's pre-paged-attention measurements was 20-40%. The other 60-80% was reserved-but-unused fragmentation.
PagedAttention applies a textbook OS virtual-memory solution: chop the KV cache into fixed-size blocks (typically 16 tokens per block, easily varied), maintain a per-request block table mapping logical positions to physical blocks, and allocate blocks on demand as the request generates tokens. Sharing prefixes becomes trivial because two block tables can point to the same physical block — copy-on-write semantics included.
Three consequences follow:
Fragmentation drops from ~50% to under 5%. HBM utilization in published vLLM benchmarks goes from ~35% to ~94%. Concurrency rises in proportion. You did not buy more GPUs; you just stopped wasting the memory on the ones you have.
Prefix sharing becomes free. Two requests sending the same 10K-token system prompt share 625 physical blocks. The second request pays for zero KV cache for the shared prefix. This is why prefix caching is so cheap once paged attention exists — the bookkeeping was already paid for.
The block size is a tuning knob. Small blocks (8-16 tokens) maximize sharing but increase block-table overhead. Large blocks (64-256 tokens) reduce bookkeeping but waste memory on partially-filled tail blocks. The default of 16 is a reasonable starting point; production teams with heavy prefix sharing sometimes drop to 8.
If you have ever maintained a file system or a memory allocator, this is your home turf. The page table is the page table. The dirty-block tracking is dirty-block tracking. Treat the paged-attention layer like the kernel virtual-memory subsystem it is, and tune accordingly.
Prefix Caching Is Content-Addressable Storage
Once blocks exist, the natural next question is: when a new request arrives with a prefix that overlaps a previously-completed request, can we reuse the blocks instead of recomputing them?
This is prefix caching, and in 2026 it is the single largest operational cost lever in production inference — exceeding even speculative decoding for many workloads. The mechanism is identical to a content-addressable store: hash the contents of each block (typically the cumulative hash of all preceding tokens), use the hash as a key, store the resulting physical block in an LRU pool. When a new request arrives, walk its prefix block-by-block, hashing as you go, and reuse any matching blocks from the pool.
The two main implementations diverge interestingly:
vLLM's automatic prefix caching uses block-level hashing. Cumulative hashes are computed at every block boundary, and the hash table maps to physical blocks. Simple, fast, and works.
SGLang's RadixAttention (Zheng et al., 2024) goes further. Instead of a flat hash table, it maintains a radix tree over all cached prefixes, with each tree node holding the physical blocks for a token range. Lookups become tree walks; sharing becomes structural. The tree representation makes it trivial to find the longest matching prefix for a new request, even if no single previous request was identical.
Why does this matter? In real production traffic, prefixes overlap in fragmented ways. Two requests share the system prompt and the user's last three turns, but diverge on the current turn. A hash-based cache catches the system prompt; a radix-tree cache catches the system prompt and the partial conversation. Production benchmarks routinely show 15-30% better hit rates from radix-tree caching over flat hashing on multi-turn chat workloads.
Across our production deployments, the prefix cache hit rate dominates economics:
effective_cost_per_token =
miss_rate × full_prefill_cost
+ hit_rate × hash_lookup_cost
With a 70% hit rate on long system prompts, you pay 30% of the prefill cost you would have without caching. The compounding is severe — moving from a 30% to a 75% hit rate roughly halves your prefill GPU cost. This is the cache curve that database engineers spent forty years climbing, and the same curve now governs your LLM bill.
The 60-token rule, and why your hit rate is probably lower than you think
Most engines require prefix matches to be at least one block long (16 tokens) to bother caching, and prompts shorter than ~60 tokens often miss caching entirely because of block-alignment edge cases. If your system prompt is on a different alignment than your retrieved chunks, you can have 0% hit rate on the chunks even when they repeat across requests. Audit your prompt templates: put high-reuse content (system prompt, persistent tool definitions) at the very top, block-aligned, and shorten anything between them and your user turn.
Eviction: The Policy That Decides Your Hit Rate
Caches are interesting because they are bounded. The KV cache pool has finite physical blocks; when a new request needs a block and the pool is full, the engine must evict. The policy here directly controls your hit rate, and it is one of the least-discussed knobs in LLM serving.
The dominant policies in 2026:
LRU (Least Recently Used). vLLM's default. Simple, well-understood, but blind to access frequency. A system prompt used by 99% of requests gets evicted just as easily as a one-shot retrieved chunk.
LFU (Least Frequently Used). Used by some custom forks. Captures recurrence but can pin stale blocks indefinitely if they had a popularity spike days ago.
Prefix-aware eviction. SGLang's radix-tree cache naturally evicts leaf-first, which is correct: the leaves are the most recent, lowest-reuse continuations, while the root and inner nodes (shared system prompts) are protected. This policy is structurally prefix-aware in a way no flat LRU can be.
TTL-bounded promotion. Anthropic's prompt caching API exposes two tiers — a 5-minute ephemeral cache and a 1-hour long-lived cache — with the caller controlling which tier each block lives in. This is essentially the LLM equivalent of SET ... EX 300 in Redis. The application, not the engine, makes the policy decision, which is the right design for multi-tenant clouds.
Pinning. Production teams running with long-lived system prompts (chat-style agents, RAG with stable corpora) increasingly pin specific prefixes so they are never evicted. vLLM's --enable-prefix-caching plus block-level pinning gives you this; SGLang supports it via cache-aware request scheduling. Pinning the right blocks can push hit rates from ~50% to ~90% on workloads with stable system prompts.
The single best operational practice we have seen is measuring hit rate per logical prefix segment, not aggregate. Aggregate hit rate of 60% can hide the fact that your system prompt is hitting 95% (good) while your retrieved document chunks are hitting 5% (catastrophic). Disaggregated metrics tell you where to act.
Multi-Tenant Sharing and the Fair-Share Problem
The instant you serve more than one user from a shared KV cache, you have a fair-share problem identical to the one Linux's CFS, Borg, and Kubernetes have all solved in their own ways. A noisy-neighbor tenant with a 128K-token long-context job can evict every other tenant's working set, tanking aggregate hit rate and visibly hurting unrelated requests.
The mature engines address this with quota systems:
- Per-tenant block budgets ensure no single user can occupy more than
N%of the cache. - Priority classes let premium tenants resist eviction while free-tier tenants get evicted first.
- Cache partitioning carves the physical block pool into per-tenant slices, sacrificing some global sharing for guaranteed isolation.
The right framing is, again, database-shaped: this is the buffer-pool partitioning problem solved by RDBMSs since the 1980s. The lessons transfer. Hard partitioning gives predictable performance but lowers aggregate utilization. Soft quotas with priority eviction give better utilization but require eviction policies that actually respect priority. Most teams under-invest here until their first noisy-neighbor incident.
Disaggregated KV: Mooncake, Dynamo, and the Network-Bound Future
The frontier in 2026 is disaggregated KV. Until recently, the KV cache lived inside the inference replica that produced it; if the replica scaled down, the cache went with it. Two systems have meaningfully changed this.
Mooncake (Moonshot AI, 2024) introduced a KVCache-centric architecture for serving Kimi-style long-context models. Mooncake separates prefill nodes from decode nodes, with both reading and writing to a shared cluster-wide KV pool over the data-center network. Prefix matches are looked up against the entire cluster, not just the local replica, so cache hit rate scales with total cluster cache rather than per-replica cache. Reported production hit rates from Moonshot on Kimi exceed 80% on long-context workloads precisely because of this.
NVIDIA Dynamo (GTC 2025, GA 2026) generalizes the same idea: a KV router, a KV manager, and a KV transfer fabric (NIXL) sit between prefill and decode replicas. Crucially, Dynamo's KV transfer is hardware-aware: when a request's KV must move between replicas, it moves over NVLink-Switch or InfiniBand at hundreds of GB/s, not over Ethernet. This is the difference between "remote KV" being a 10-100× slower tier and a 5-10× slower tier — enough to make remote hits often faster than local recomputation.
The architectural consequence is real: in 2026, a well-designed inference cluster is no longer a set of independent replicas. It is a tiered storage system with compute attached. The KV pool is the database; the model replicas are the workers. The frontier engines treat it that way explicitly.
KV Cache Quantization: Compression Without Tears
If the KV cache is a database, then quantization is compression. The math is simple: store KV in 8-bit, halve the memory; store in 4-bit, quarter the memory; double or quadruple your concurrency at no model cost.
The catch is that, unlike weight quantization, KV quantization is runtime quantization — every prefill writes to it, every decode reads from it, and any per-token outlier handling has to be cheap. The research has converged on three families:
- KIVI (Liu et al., 2024) — per-channel quantization for keys, per-token quantization for values. Hits 2-bit with minimal quality loss on most benchmarks.
- KVQuant (Hooper et al., 2024) — non-uniform quantization with per-channel sensitivity. Best quality at 3-bit, slightly more complex kernels.
- GEAR (Kang et al., 2024) — combines quantization with a low-rank correction. Used for very long contexts where simple per-channel quantization breaks down.
In production, FP8 KV caching (supported on Hopper and Blackwell with native fp8 attention kernels) is now the standard recommendation: it doubles concurrency vs FP16, has effectively zero quality cost, and ships as a one-flag enable in vLLM, SGLang, and TensorRT-LLM. INT4 KV is becoming standard for memory-constrained edge deployments. The math is the same as any compression: the question is whether the savings (2-4× concurrency) exceed the cost (slightly increased kernel complexity and a small quality risk worth measuring). For most workloads the answer is yes.
The Economics, Quantified
Three operational levers — paged attention, prefix caching, and KV quantization — compound multiplicatively. A worked example on a 70B-class model with mixed chat traffic, 8K average context, system prompts shared across 60% of requests:
| Optimization | Concurrency × | TTFT improvement | Prefill cost reduction |
|---|---|---|---|
| Baseline (contiguous KV, no caching) | 1.0× | — | — |
| + Paged attention | 2.7× | — | — |
| + Prefix caching (50% hit rate) | 2.7× | 2.0× faster | 50% |
| + Prefix caching (75% hit rate) | 2.7× | 4.0× faster | 75% |
| + FP8 KV quantization | 5.4× | unchanged | unchanged |
A team that engineers all three of these — and almost any team running on modern vLLM or SGLang has all three available — can serve 5-10× the traffic per GPU compared to a baseline that treats KV cache as transient memory. The cost per token follows directly. For a workload spending $30K/month on inference, this is the difference between $30K and $4K. That gap is what makes the KV cache the most under-tuned component in modern production AI.
Persistent KV and the Coming Agent-Memory Reckoning
The deepest implication of treating KV cache as a database is that, like every other database, it could be persistent.
Today, the KV cache typically lives only as long as the request that produced it (or, with prefix caching, a few minutes longer). But a long-running agent — a coding assistant working on a multi-day task, a customer-support agent that has spoken to one user for six months, an autonomous research agent investigating a corpus over weeks — has a working set that is logically the same KV cache, just across sessions and across replicas.
The early production patterns:
- Anthropic's 1-hour cache tier (and the equivalent OpenAI prompt-caching offerings) represents the first commercial step toward persistent KV. Bill the user once for the prefill, then let them amortize that cost across many follow-ups within a window.
- Per-session NVMe-backed KV ships in vLLM-Pro and DeepSpeed-Inference's KV-Offload module. Persist the request's working set to local NVMe between turns; reload on the next turn. Sub-millisecond resume.
- Cluster-wide KV stores (Mooncake's architecture) make the KV pool durable across replica restarts.
The destination this is all driving toward is plain: the KV cache becomes a persistent, queryable, multi-tenant memory layer — a real database in the operational sense, not just the analogy. The applications that take it seriously will have unit economics that look entirely different from those that recompute the world on every turn.
What This Means for Your Stack
Three questions to ask of your inference deployment, today:
-
What is your prefix cache hit rate, per logical prefix segment? If you do not have this metric on a dashboard, you cannot tune the largest cost lever in your stack. Most engines expose it via Prometheus.
-
What is your KV-cache HBM utilization? Should be above 90% under load. If it is below 70%, your paged-attention configuration is fragmenting and concurrency is unnecessarily capped.
-
Are you running FP8 KV quantization? On Hopper or Blackwell, you should be. The concurrency doubling is essentially free.
The framing matters. Inference engineering teams that treat the KV cache as a buffer optimize first-token latency in isolation, never look at hit-rate curves, and over-provision GPUs. Teams that treat it as a database build dashboards, set hit-rate SLOs, partition multi-tenant traffic, and run on a fraction of the hardware.
The deepest lesson from forty years of database engineering is that storage hierarchies, once they exist, reward you for paying attention to them and punish you for ignoring them. The KV cache, whether your team has noticed or not, is now a storage hierarchy. The vocabulary of buffer pools, working sets, eviction policies, and content-addressable stores is not metaphor — it is the literal model of what your inference engine is doing under the hood.
Open your inference metrics endpoint. Find prefix_cache_hit_rate. That number is the most important number in your inference stack you have probably never looked at.
References
- Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., & Stoica, I. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. arXiv:2309.06180
- Zheng, L., Yin, L., Xie, Z., Sun, J., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., & Sheng, Y. (2024). SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2024. arXiv:2312.07104
- Qin, R., et al. (2024). Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079
- Liu, Z., et al. (2024). KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache. ICML 2024. arXiv:2402.02750
- Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., & Gholami, A. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization. NeurIPS 2024. arXiv:2401.18079
- Kang, H., et al. (2024). GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM. arXiv:2403.05527
- NVIDIA. Dynamo Distributed Inference Framework. github.com/ai-dynamo/dynamo
- vLLM project. Automatic Prefix Caching documentation. docs.vllm.ai
- Anthropic. Prompt Caching with Claude. docs.anthropic.com
Was this article helpful?
Related Posts
Speculative Decoding in Production: How a 1B Draft Model Cuts 70B Latency by 3-5×
The largest single inference speedup of the last three years is also the most invisible to application developers. A small draft model proposes tokens; a big model verifies them in parallel; the math guarantees the output distribution is unchanged. Here is how it actually works — and why your stack probably has it on already.
Read moreAI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.
ARC-AGI-3 just launched and current AI scores under 5%. The same week GPT-5.4 solved an open research math problem. This is not a contradiction. It is the most important insight about intelligence published this decade.
Read moreHow LinkedIn Replaced Five Retrieval Systems with One LLM at 1.3 Billion User Scale
LinkedIn tore apart five separate recommendation pipelines and rebuilt them as a single LLM-powered system. Here's exactly how — and what you can steal for your own stack.
Read moreComments
No comments yet. Be the first to share your thoughts!