Speculative Decoding in Production: How a 1B Draft Model Cuts 70B Latency by 3-5×
The largest single inference speedup of the last three years is also the most invisible to application developers. A small draft model proposes tokens; a big model verifies them in parallel; the math guarantees the output distribution is unchanged. Here is how it actually works — and why your stack probably has it on already.
