Speculative Decoding in Production: How a 1B Draft Model Cuts 70B Latency by 3-5×
The largest single inference speedup of the last three years is also the most invisible to application developers. A small draft model proposes tokens; a big model verifies them in parallel; the math guarantees the output distribution is unchanged. Here is how it actually works — and why your stack probably has it on already.
In the systems-engineering literature on large language models, very few ideas qualify as a true free lunch. Speculative decoding is one of them.
The technique is now standard in every serious inference engine — vLLM, SGLang, TensorRT-LLM, LMDeploy, llama.cpp — yet most application developers have never read the rejection-sampling proof that makes it work, never tuned a draft model, and never inspected the acceptance rate of their own deployment. They just observe that tokens-per-second went up. The default in vLLM 0.7+ is already to attempt speculative execution where it makes sense; SGLang ships EAGLE-3 as a first-class draft method; NVIDIA's Dynamo platform builds disaggregated speculative serving directly into the scheduler.
This article is the missing piece between "I read the original Leviathan paper" and "I tuned this for a 200-replica production cluster." We will derive why the algorithm preserves the target distribution exactly, walk through the four families of draft methods that matter in 2026 (vanilla, Medusa, EAGLE-3, lookahead/n-gram), and then look at the production knobs — speculative length, acceptance threshold, batch interaction, KV cache layout — that decide whether you actually see a 3× speedup or quietly pay a 1.2× tax for a feature you thought was helping.
If you serve LLMs and you cannot tell me your acceptance rate, your average accepted length, or your effective tokens-per-second per replica with and without speculation, you are leaving the largest performance lever in the modern inference stack untuned.
Why Decoding Is Slow in the First Place
Autoregressive decoding has a deeply unflattering hardware profile. Every output token requires a full forward pass through the model. For a 70B-parameter model in FP8, that is roughly 70 GB of weights that must be streamed from HBM into the SM register files, just to produce a single token. On an H200 with ~4.8 TB/s of memory bandwidth, the absolute floor on per-token latency is around 14 ms — and that ignores attention over the KV cache, which grows linearly with context.
The arithmetic intensity of decoding — operations per byte of memory read — is essentially zero. A single matmul of shape (1, d_model) × (d_model, d_model) performs 2·d_model² flops while reading d_model² bytes; the ratio is 2 ops/byte, while modern GPUs want 300+ ops/byte to be compute-bound. This is why the FLOPS counter on your nvtop reads 5% during decode and the memory-bandwidth counter reads 95%. Decoding is not compute-bound. It is bandwidth-bound.
This single observation drives the entire field. If you could somehow process N tokens through the model in the same forward pass — paying once for the weight reads — you would amortize that read over N tokens and get a near-N× speedup, until you finally saturated the compute units. Modern GPUs can typically swallow 4-8 tokens per "decode step" before compute begins to bind. That headroom is what speculative decoding spends.
The single hardware fact that justifies the entire technique
A decode-step matrix multiplication on an H200 reaches roughly 5–10% of peak compute. The remaining 90% is idle compute waiting for memory. Speculative decoding turns that idle compute into useful work by verifying multiple draft tokens in one forward pass — and gets it almost for free, because the marginal cost of running a few extra tokens through the model during a single decode is much smaller than running another full decode step.
The Core Algorithm
Speculative decoding, introduced concurrently by Leviathan et al. (Google, 2023) and Chen et al. (DeepMind, 2023), works as follows. Let M_q be a small draft model (e.g., a 1B distilled version of the target) and M_p be the target model (e.g., 70B). The draft samples cheaply; the target is what we ultimately want to sample from.
For each decoding round:
-
The draft model
M_qautoregressively generatesγcandidate tokensx_1, x_2, …, x_γfrom the current context. This is fast becauseM_qis small. -
The target model
M_pperforms a single forward pass over the prefix plus allγdraft tokens, producing logits — and therefore a probability distributionp(·|prefix, x_<i)— at every position. Crucially, this is one forward pass, notγ. The same weights are read once. -
We then walk the draft tokens left to right and accept each one with probability:
α_i = min(1, p_i(x_i) / q_i(x_i))
where p_i is the target's distribution at position i and q_i is the draft's. If a token is rejected, we resample it from the adjusted distribution p'_i ∝ max(0, p_i − q_i), normalize, and stop accepting further drafts.
- If all
γdraft tokens are accepted, we additionally sample one bonus token fromp_{γ+1}, since the target's forward pass already produced its distribution. So in the lucky case we emitγ + 1tokens for one target forward pass.
That is the entire algorithm. The miracle is the next part.
Why the Output Distribution Is Preserved Exactly
This is the result that makes speculative decoding production-safe. Most inference optimizations — quantization, distillation, pruning, flash attention with reduced-precision softmax — change the output distribution. Speculative decoding does not. Its output is mathematically indistinguishable from sampling directly from the target model.
The proof is short. Consider any token x and the probability that the algorithm emits it. Two disjoint cases:
Case 1: x was the draft token and it was accepted. This happens with probability q(x) · min(1, p(x)/q(x)) = min(q(x), p(x)).
Case 2: x was the draft token but it was rejected, and x was then resampled from p' ∝ max(0, p − q). The probability of any rejection at all is 1 − Σ_y min(p(y), q(y)), and conditional on rejection we resample x from p'(x) = max(0, p(x) − q(x)) / (1 − Σ_y min(p(y), q(y))). Multiplying gives max(0, p(x) − q(x)).
Total probability of emitting x:
min(q(x), p(x)) + max(0, p(x) − q(x))
= min(q(x), p(x)) + (p(x) − min(p(x), q(x)))
= p(x).
We recover p(x) exactly. This is just rejection sampling done right: the draft q is the proposal, and the acceptance/resampling step adjusts for the proposal/target mismatch. The result is unconditional. It does not depend on how good the draft model is. A bad draft will produce many rejections — slow, but still correct — while a good draft produces many accepts, and you win speed.
This is the property that makes speculative decoding usable in production for anything from chat completions to code generation to function-call tool-use: you cannot tell from the output whether it was on or off. Only the wall-clock latency reveals it.
The Speedup, Quantified
If α is the average per-token acceptance probability and γ is the speculation length, the expected number of tokens emitted per round is the geometric truncated at γ:
E[tokens per round] = (1 − α^(γ+1)) / (1 − α)
The wall-clock cost per round is one target forward pass on γ + 1 positions plus γ cheap draft passes. If the draft costs a fraction c = T_q / T_p of the target, then the speedup over standard decoding is:
Speedup = (1 − α^(γ+1)) / [(1 − α) · (1 + c·γ)]
Plug in realistic numbers. For a Llama-3.3-70B target with a Llama-3.2-1B draft, c ≈ 0.02 — the draft is roughly 50× cheaper. With α = 0.7 (typical for chat traffic on a well-distilled draft) and γ = 5, the formula gives a 2.7× speedup. With α = 0.85 (typical for code and structured output, where the draft predicts boilerplate and brackets reliably) and γ = 7, you get 4.4×. The literature's published 3-5× figures are not marketing — they fall directly out of this geometry.
The Four Families of Draft Methods
The original Leviathan/Chen formulation uses a separate, smaller model as the draft. That works, but it has practical downsides — you have to host two models, the draft has its own KV cache, and finding a draft well-distilled enough to hit high acceptance requires a real training pipeline. The last two years of research have produced three additional draft families that sidestep these costs.
1. Two-Model Speculative Decoding (Vanilla)
The original formulation. A smaller model from the same family acts as the draft. Strengths: simple to integrate, works with any sampler, well-understood. Weaknesses: needs a second model loaded in memory, two KV caches, draft latency is non-trivial unless the draft is very small.
In production, the canonical pairings are: Llama-3.3-70B with Llama-3.2-1B; DeepSeek-V3 with DeepSeek-V3-Lite (8B); Qwen2.5-72B with Qwen2.5-1.5B. The draft must share the tokenizer and ideally be distilled from the target so its distribution q is close to p — close drafts have higher α, and the speedup is highly sensitive to α once it drops below ~0.5.
2. Medusa: Parallel Multi-Head Drafting
Cai et al. (2024) made a deceptively simple observation: instead of running an entire small model γ times, attach γ extra prediction heads on top of the target itself. Each head is trained to predict the token at position t + k, in parallel, from the same hidden state. There is no draft model — the "draft" is just γ cheap linear heads sharing the target's representation.
Medusa's strength is operational simplicity: one model, one KV cache, no draft pipeline. Its weakness is that each head is conditionally independent given the same hidden state, so the joint draft sequence is less coherent than what a sequential draft model produces. Acceptance rates are typically 10-20 points lower than a well-distilled two-model setup. Medusa-2 (2024) and the follow-up tree-decoding extensions ameliorate this by drafting multiple candidate sequences and verifying them with a single target forward pass over a tree — see SpecInfer (Miao et al., 2024) for the canonical tree formulation.
3. EAGLE-3: Feature-Level Drafting
EAGLE (Li et al., 2024) and EAGLE-3 (2025) take the most aggressive approach: instead of drafting tokens from text, draft in feature space. The draft network is a single transformer layer that consumes the target model's penultimate-layer features (not just embeddings) and predicts the next feature, which is then projected through the target's LM head to produce a token distribution.
This works because the target's hidden states are vastly more informative than its tokens — they contain everything about the model's belief state right before the LM head collapses it to a discrete distribution. Drafting in feature space gives the draft network access to that uncollapsed information, which translates into much higher per-step acceptance rates. EAGLE-3 reports 3.5-5× wall-clock speedups across Llama and Qwen targets, with α regularly above 0.85 even at γ = 8.
EAGLE-3 is now the default speculation backend in SGLang and is supported by vLLM 0.8+ and TensorRT-LLM. Its main cost is training: the draft layer requires its own training run on outputs from the target, typically a few hundred million tokens. For closed-weight models you cannot train your own — you are limited to whatever the model provider ships.
4. Lookahead Decoding and N-Gram Drafting
Fu et al. (2024) introduced lookahead decoding, which uses no auxiliary model at all. Instead, the target model itself runs a Jacobi-style parallel iteration over a 2D grid: at each step it produces guesses for the next γ positions, then in subsequent steps refines those guesses while extending the frontier. Verified n-grams are pulled out of the grid as they stabilize.
For coding workloads in particular, an even simpler trick often beats sophisticated drafts: prompt lookup decoding (PLD), proposed in vLLM and now standard, just searches the existing context for a matching n-gram and uses the next few tokens as the speculative draft. This is shockingly effective for code (where the same identifier appears repeatedly), document QA (where the model frequently quotes its retrieved context), and JSON generation (where structural tokens recur). No model is needed — it is essentially a grep over the prompt.
PLD is the right starting point if you have not yet measured your workload. It costs nothing to enable, requires no training, and on retrieval-augmented and code-completion traffic frequently delivers 1.5-2× speedups for free. If your workload has high lexical recurrence, PLD may be all you need. If it does not, you graduate to EAGLE.
What Drives Acceptance Rate
The single number that determines whether speculation is worth doing is the per-token acceptance rate α. Everything else is secondary. A few factors shape it strongly.
Distribution similarity between draft and target. This is what distillation buys you. A draft trained on the target's outputs (rather than just the original training data) directly minimizes the KL divergence that controls α. Off-the-shelf "small models from the same family" are usually distilled by the model provider, but the quality varies. Llama-3.2-1B is well-distilled from the 70B; some open-source community draft models are not.
Sampling temperature. At T = 0 (greedy), α reduces to whether the draft's argmax matches the target's argmax. At higher temperatures, both distributions spread out and α becomes a softer mass-overlap quantity — usually higher than the greedy case. Many production deployments measure α at their actual production temperature; do not benchmark at T = 0 and assume the number generalizes.
Domain shift. A draft trained mostly on web text will accept poorly on legal contract drafting, mathematical proofs, or non-English languages it was undertrained on. If your traffic is highly specialized, an in-domain fine-tuned draft is worth the engineering cost. EAGLE drafts in particular benefit from domain fine-tuning more than full draft models do, because the EAGLE layer is small and adapts quickly.
Batch composition. This is the production gotcha most teams miss: speculation interacts non-trivially with continuous batching. A request being verified takes a forward pass over γ + 1 positions; a request not being verified takes a single position. Mixing them in the same batch wastes work on the small request unless the scheduler is speculation-aware. SGLang and vLLM 0.8+ handle this; older versions may show worse throughput when speculation is on and batches are large.
The high-throughput regime is where speculation can lose
At low concurrency (1-4 in-flight requests), the GPU is bandwidth-bound and speculation is pure win. At high concurrency (32+), the GPU is already compute-saturated by continuous batching — the "free" verification capacity that speculation exploits is gone, and the extra draft tokens become pure overhead. Most modern engines automatically disable speculation per-request when the batch is hot. If yours doesn't, you are paying a 5-15% throughput tax under load. Measure both regimes before declaring victory.
A Concrete Production Setup
Let us make this real. Suppose you are running Llama-3.3-70B in FP8 on 4× H200 with TP=4 for chat traffic, behind vLLM. Here is what a production-ready speculative configuration looks like, and what to measure.
from vllm import LLM, SamplingParams
# Two-model speculative decoding — Llama-3.3-70B target,
# Llama-3.2-1B as the distilled draft.
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=4,
quantization="fp8",
speculative_config={
"model": "meta-llama/Llama-3.2-1B-Instruct",
"num_speculative_tokens": 5,
# vLLM 0.8+ — EAGLE-3 is also supported by setting:
# "method": "eagle3",
# "model": "yuhuili/EAGLE3-LLaMA3.3-Instruct-70B",
},
# Disable speculation when the running batch is too hot —
# this is the knob that prevents the high-throughput regression.
speculative_disable_by_batch_size=8,
)
params = SamplingParams(
temperature=0.6,
top_p=0.9,
max_tokens=512,
)
outputs = llm.generate(
["Explain how rejection sampling preserves the target distribution."],
params,
)You should benchmark four things and watch them drift over time:
-
End-to-end tokens-per-second per replica, with and without
speculative_config. This is the headline number your CFO cares about. Expect 2-4× on chat, 3-5× on code, 1.5-2× on retrieval QA with PLD alone. -
Acceptance rate
α, exposed by vLLM's metrics endpoint asvllm:spec_decode_efficiency. Below 0.5, your draft is poorly matched and you should consider distillation or switching to EAGLE. Above 0.8, you can probably increasenum_speculative_tokens. -
Average accepted length per round. This should be in the
2.5 – 5range forγ = 5. If it sits at1.0, your draft is rejecting almost everything and you are running speculation as net overhead. -
TTFT (time to first token) impact. Speculation primarily helps inter-token latency, not TTFT. If your latency SLO is dominated by TTFT, the user-visible improvement will be smaller than the tokens-per-second ratio suggests.
When Speculative Decoding Goes Wrong
A non-exhaustive list of the failure modes we have seen in production:
Tokenizer mismatch. The draft and target must share a tokenizer exactly. A subtle mismatch — different BPE merges, an added special token — silently produces a q distribution that is incompatible with p, and acceptance collapses. This is the first thing to check when α looks unreasonably low.
Wrong sampler. Min-p sampling, repetition penalties, logit biases, structured-output constraints (JSON mode, grammar-constrained decoding) all transform p into a different distribution p'. The acceptance test must use p', not p, or the rejection-sampling guarantee breaks. Mature engines handle this correctly; ad-hoc speculative-decoding implementations often do not. If you are running grammar-constrained decoding, verify your engine applies the grammar inside the acceptance test.
Long contexts. With 128K-token contexts, the draft model's KV cache becomes non-trivial. A 1B draft at 128K context has roughly 0.5 GB of KV cache per request — manageable, but it competes with the target's KV cache for HBM. EAGLE wins here because its draft layer is tiny and shares the target's KV.
Speculative length too long. Setting γ = 16 "to be safe" usually loses, because α^γ decays geometrically and the verification pass becomes expensive on rejects. The sweet spot for most workloads is γ ∈ [3, 7]. Tune empirically.
Cold start. The first decode round has no useful PLD context and EAGLE's hidden states are flat. Speculation typically does not pay off until 30-50 tokens into a generation. For very short outputs (classification, single-sentence completions), speculation can be a net loss. Engines like SGLang detect this and disable speculation per-request when the predicted output is short.
The Frontier: Disaggregated and Cross-Tier Speculation
Two research directions are actively reshaping the production landscape and worth watching.
Disaggregated speculation runs the draft and target on separate hardware tiers. NVIDIA's Dynamo platform (introduced at GTC 2025 and matured in 2026) places drafts on smaller GPUs and the target on H200/B200, communicating through high-bandwidth interconnect. This decouples scaling — you can scale up draft replicas during bursty workloads without renting more flagship GPUs. Early production deployments report 20-30% better throughput-per-dollar than monolithic speculation.
Cross-tier speculation across model families. The original assumption that draft and target must come from the same family is being relaxed. Recent work (e.g., universal speculative decoding using a small distilled "translator" between mismatched tokenizers) has shown viable acceptance rates with mixed-family pairs, opening the door to using a single high-quality draft (say, a 1B Llama distill) to accelerate many different target models in a multi-tenant gateway. We expect this to be the next mainstream production pattern, particularly for inference vendors hosting dozens of customer-fine-tuned variants of the same base model.
What This Means for Your Stack
If you operate an LLM inference deployment at any meaningful scale, three things follow from the analysis above:
-
Speculative decoding is not optional in 2026. It is the difference between paying for 1× GPU capacity and 3-5× GPU capacity, and it preserves the output distribution exactly. Every production engine supports it. Turn it on, measure acceptance, tune
γ. The effort is hours; the payoff is sustained. -
Your workload determines the right draft method. Code and document-QA traffic with high lexical recurrence: start with PLD, it costs nothing. General chat traffic with diverse outputs: two-model with a well-distilled draft, or EAGLE-3 if available. Specialized domain traffic: domain-fine-tuned EAGLE draft. There is no single best choice.
-
The numbers you should know cold. Your acceptance rate. Your average accepted length. Your effective tokens-per-second per replica. Your behavior under high concurrency. If any of these are unfamiliar, the largest performance lever in your stack is currently untuned, and a fix is a configuration change away.
The deepest reason speculative decoding has become ubiquitous is not its speedup. It is the rejection-sampling proof — the fact that the technique is mathematically a no-op on output quality. Most inference optimizations force a quality/latency trade. Speculative decoding refuses to. That is an unusually clean property in a field full of careful approximations, and it is why every serious serving system has converged on it.
If you take one thing from this article: open your inference engine's metrics endpoint right now and look for acceptance_rate. Whatever number you see is the most important number in your inference stack. Most teams have never looked at it.
References
- Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023. arXiv:2211.17192
- Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318
- Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., & Dao, T. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. ICML 2024. arXiv:2401.10774
- Li, Y., Wei, F., Zhang, C., & Zhang, H. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. ICML 2024. arXiv:2401.15077
- Fu, Y., Bailis, P., Stoica, I., & Zhang, H. (2024). Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. arXiv:2402.02057
- Miao, X., et al. (2024). SpecInfer: Accelerating Large Language Model Serving with Tree-Based Speculative Inference and Verification. ASPLOS 2024. arXiv:2305.09781
- vLLM project. Speculative Decoding documentation. docs.vllm.ai
- SGLang project. EAGLE-3 integration notes. github.com/sgl-project/sglang
Was this article helpful?
Related Posts
AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.
ARC-AGI-3 just launched and current AI scores under 5%. The same week GPT-5.4 solved an open research math problem. This is not a contradiction. It is the most important insight about intelligence published this decade.
Read moreHow LinkedIn Replaced Five Retrieval Systems with One LLM at 1.3 Billion User Scale
LinkedIn tore apart five separate recommendation pipelines and rebuilt them as a single LLM-powered system. Here's exactly how — and what you can steal for your own stack.
Read moreMeta Spent $14 Billion to Win the AI Race. Its Next Model Still Isn't Ready.
Meta's Avocado model has been quietly pushed to May — even as the company bets $14.3 billion on Scale AI to close the gap with rivals. What's really going on inside Meta's AI machine?
Read moreComments
No comments yet. Be the first to share your thoughts!