Speculative Decoding

May 2026 – Vladislav Kruglikov

Speculative decoding accelerates text generation by putting idle compute to work. A small, fast model drafts several tokens ahead, and the large model verifies them in a single forward pass. When the draft is right, you get multiple tokens for roughly the cost of one target-model step, while preserving generation quality. It helps most when outputs are predictable, generations are long, and there are idle FLOPs available.

Intuition Behind Speculative Decoding
General Framework
Draft Systems
Verify Systems
What Works Today
Epilogue

Intuition Behind Speculative Decoding

The intuition starts with the difference between prefill and decode. Prefill is compute-bound. Many tokens are processed together, so loading the model weights is amortized over a lot of matrix multiplication. At small batch sizes, decode is often memory-bound. Each request usually adds only one token, so the GPU keeps reloading huge weights for a small amount of compute.

At low batch size, this leaves tensor cores underutilized. Verifying one token and verifying a few candidate tokens can have similar wall-clock cost, because the expensive part was loading the weights in the first place. Extra verified positions behave like extra batch elements.

The other reason this works is that many continuations are easy. «The capital of France is» is easy. So are many pieces of code, JSON, templates, and low-entropy continuations. Speculative decoding uses a cheap drafter to guess those easy tokens, then asks the target model to verify them in parallel. When the guesses survive, decode advances by several tokens instead of one.

Branch Prediction Intuition

For systems engineers, the closest analogy is branch prediction. The draft system is the cheap predictor. It guesses the next few tokens before the expensive model has fully committed to them. The verifier is the correctness check. If the prediction matches what the target model would have sampled, you retire several tokens at once. If not, you discard the wrong suffix and continue from the last accepted token. The important difference is that speculative decoding can do this losslessly, so the target distribution is preserved.

General Framework

Speculative decoding separates generation into proposing and verifying. A cheap draft system proposes several possible next tokens. The target model verifies those proposals in parallel and accepts the longest prefix that preserves its original output distribution.

Suppose we have a prompt:

The drafter turns it into a candidate tree. This can be an n-gram lookup, a small LM, an EAGLE-style head, or any other cheap proposal mechanism. Accuracy helps, but the drafter does not have to be perfect; rejected branches are simply thrown away.

The verifier scores the whole tree with one target-model forward pass. The accepted path stays; every suffix after the first rejected token disappears.

Here the accepted path is «Paris, a» — three tokens from one verification step. At this abstraction level, draft systems differ only in how quickly they propose candidates and how often those candidates survive verification.

Speedup Across Different Load Conditions

Under low load, this accepted length can turn almost directly into latency speedup, because the verifier has spare parallelism. The target pass can score several proposed positions while taking roughly the wall-clock time of one normal decode step.

At saturation, the sign depends on the bottleneck. If target decode is compute-bound, speculative verification competes with real request tokens. A real request position deterministically advances one user sequence; a drafted position only helps if it is accepted. In that regime, speculation can trade guaranteed useful work for probabilistic useful work, while also paying drafter overhead, so throughput can go down.

If target decode is memory-bound, the picture is different. A normal decode step streams the target weights and produces one token per request. A speculative verification pass can stream those same weights once and score multiple positions per request. Then the extra verified positions are not as expensive as separate decode steps, and accepted length can still improve throughput even near saturation. Current frontier models also often use MoE layers, which makes compute saturation even harder: each token activates only a slice of the model, while expert routing can still add memory traffic.

When Speedup Is Largest

Speculative decoding is strongest when two things are true: the drafter can guess well, and the verifier has idle capacity to check extra positions.

The first condition is about entropy. Structured text, code, JSON, math, templates, and other predictable outputs give the drafter fewer plausible next tokens to miss. Sampling settings matter too. Lower temperature, smaller top-k, and tighter top-p shrink the candidate set, so draft tokens are more likely to survive verification.

The second condition is about hardware headroom and decode share. Speculative decoding accelerates decode, so it matters most when decode is a large fraction of end-to-end latency. Long generations usually satisfy that because prefill and request overhead are amortized over many generated tokens. Low batch size helps because it often means spare FLOPs are available. If the GPU is already saturated, speculative work competes with real requests; if batch size is small, verification can often check several drafted tokens for roughly the same wall-clock time as one.

Quantization can create even more headroom. At small batch sizes, decode is often memory-bound, so reducing weight bandwidth with FP8 makes each decode step faster while leaving more idle tensor-core capacity. On H100 SXM, dense Tensor Core throughput is roughly 989 TFLOPS for BF16/FP16 and 1,979 TFLOPS for FP8; the larger NVIDIA headline numbers include sparsity. That makes speculative verification easier to fit into otherwise idle compute.

Average And Tail Latency

Speculative decoding can improve average TPOT while worsening P99. The good case retires multiple accepted tokens in one verification pass. The bad case pays draft overhead, verifies the draft, accepts nothing, and still advances by only one corrected target token.

Draft Systems

This section is a chronological map of draft systems I would actually consider when building a speculative decoding stack. I recommend reading it oldest to newest. Later systems usually inherit assumptions, fixes, and failure modes from earlier ones. Each cut is a compact technical note with practical serving and training details, not a full paper summary.

N-Gram Drafting — Local Reuse Without Training

N-gram drafting proposes tokens by local suffix matching. Find a previous occurrence of the current suffix and copy the continuation that followed it. It needs no neural drafter and no training. This is useful for repeated code, templates, structured text, and copy-heavy RAG, but it is only lexical reuse. It does not reason over distant context; it only copies exact lexical spans from earlier tokens, and it cannot infer semantic continuations. Its acceptance rate is usually low, so meaningful speedup requires drafting many tokens or building a large tree; that only makes sense when you have a lot of idle verification FLOPs. Its main use is operational. If you have spare FLOPs and no time to train a real drafter yet, n-gram drafting is a zero-setup stopgap you can run while preparing the actual solution. Since training lightweight drafters is now relatively cheap, n-gram is mostly a temporary fallback, not the target system. See SGLang.

Vanilla Drafting — Separate Small Language Model

Vanilla drafting is the first real neural drafter after n-gram matching. Instead of copying a local continuation, it runs a smaller LM that can use the full context and model semantics.

Vanilla drafting (Fast Inference from Transformers via Speculative Decoding, Nov 30, 2022) usually uses a drafter from the same model family as the target, so it agrees on easy tokens often enough to be useful while being much cheaper to run.

DeepMind (Accelerating Large Language Model Decoding with Speculative Sampling, Feb 2, 2023) keeps the same idea but makes the drafter more deployment-aware. The drafter is much shallower and relatively wide for its parameter count, so it runs efficiently on the same hardware and pays less communication cost.

The advantage over n-gram drafting is a better tradeoff between accepted length and verification capacity. N-gram proposals are almost free, but they are weak. They only exploit local repetition and therefore often consume verification slots on candidates that the target rejects. A small neural drafter is more expensive, but it uses the full context and model semantics, so its candidates are accepted more often. In practice, drafting is usually much cheaper than target verification, while verification capacity is limited. Therefore, the better metric is accepted tokens per verification pass, and neural drafting usually dominates n-gram drafting under that constraint.

Medusa — Target Attached Future Token Heads

Medusa (Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads, Jan 19, 2024) is speculative decoding without a separate draft model. It attaches multiple heads to the last hidden state of the target model. The original LM head predicts $t+1$ , while Medusa head $k$ predicts $t+k+1$ .

A Medusa head is a small residual head applied to the same hidden state. It transforms $h_t$ , adds the original $h_t$ back in, then projects to a vocabulary distribution for one future position.

More formally:

p_t^{(k)} = \operatorname{softmax}(W_2^{(k)} \cdot (\operatorname{SiLU}(W_1^{(k)} \cdot h_t) + h_t))

You can then build a tree of possible continuations from the top predictions of those heads. For example, take the 2 most probable tokens from the first head, then for each of those match them with the 3 most probable tokens from the second head, and so on, to build a tree of possibilities. The verifier scores that tree and accepts the longest valid prefix. Naively this suggests taking a Cartesian product of top predictions, but that is wasteful because not all nodes are equally useful. Given a fixed node budget, you want to spend tree nodes on the branches most likely to increase accepted length. Medusa estimates this from a calibration dataset by measuring $a_k^{(i)}$ , the probability that the $i$ -th top prediction of head $k$ is correct, and uses those estimates to construct the tree.

Training also matters. Medusa-1 freezes the backbone and trains only the extra heads. Medusa-2 jointly trains the backbone and the Medusa heads with a combined LM loss plus head losses.

Because these heads read target-model hidden states, the training data needs to cover the hidden-state manifold you expect at serving time; broader and more recent production data gives the heads more of that space to learn from. It should also match the target model and sampling settings used in production, otherwise the heads learn the wrong output distribution and acceptance drops.

At the beginning, the new Medusa heads have large loss; updating the backbone immediately can corrupt the model's existing knowledge. So Medusa-2 first freezes the backbone and trains only the heads, which the paper calls heads warmup. After warmup, joint training uses different learning rates: the base model is already well trained and should move slowly, while the new heads need a larger learning rate. This lets hidden states become more useful for future-token prediction without destroying next-token quality.

If the original pretraining data is unavailable, the paper also discusses dataset regeneration. Ask the model to generate completions from prompts and train the heads on those completions.

This is also why joint pretraining matters. If a model was pretrained with MTP-style future-token heads, its hidden states may already be shaped to expose information useful for predicting more than the next token. That can make later post-hoc speculative drafters easier to train. You are not starting from representations that were optimized only for next-token prediction.

Medusa also introduces typical acceptance. Instead of exact rejection sampling, you accept a candidate token if the target model assigns it probability above a threshold; that threshold is dynamic and depends on entropy. If the target is confident (low entropy), the threshold is stricter. If the target is uncertain (high entropy), the threshold is looser. This improves acceptance and speed, but it is not lossless. If you want exact distribution matching, you still need rejection sampling.

More formally, for a candidate token $x_{n+k}$ in a proposed sequence, Medusa accepts it when:

$p_{\text{original}}(x_{n+k}\mid x_1, x_2, \dots, x_{n+k-1}) > \min\left(\epsilon, \delta \exp\left(-H\left(p_{\text{original}}(\,\cdot\,\mid x_1, x_2, \dots, x_{n+k-1})\right)\right)\right)$

Here $H(\cdot)$ is the entropy function, while $\epsilon$ and $\delta$ are the hard threshold and entropy-dependent threshold respectively.

EAGLE 1 — Hidden Feature Drafter With Frozen Target

EAGLE-1 (EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty, Jan 26, 2024) uses a small draft model. It is a Transformer block trained autoregressively on top of the frozen target model's second-to-top-layer features and token embeddings. It reuses the target model's embedding layer and LM head, so the trainable part is mostly the lightweight block that learns to continue the target model's representation trajectory.

The subtle trick is that EAGLE does not predict future features from previous features alone. Future features are ambiguous unless the drafter also knows which token was sampled. The same current feature can lead to different next features depending on the token branch. So EAGLE feeds the draft token shifted one step forward together with the feature sequence. The token chooses the branch; the EAGLE block predicts the target-model feature on that branch.

The objective is the important shift. EAGLE trains the drafter through future hidden features, not only through next-token labels. Token space is high entropy and brittle; hidden features are smoother, so a small model can predict their trajectory more easily. The loss is weighted. A Smooth L1 regression loss pulls the predicted feature toward the target feature, while a cross-entropy term checks that the target LM head turns that predicted feature into the right token distribution. EAGLE also adds noise to hidden states during training, which is the same robustness intuition as NEFTune: Noisy Embeddings Improve Instruction Finetuning (Oct 9, 2023). The idea is to perturb the representation space a bit so the trained module does not become too brittle around exact training activations.

EAGLE is also the first system in this list that trains and evaluates a learned drafter on an MoE target, Mixtral 8x7B Instruct. That matters because MoE verification is less «verify $K$ tokens for the cost of 1» than dense verification. Different draft tokens can route to different experts, so a wider verification batch may activate more expert weights and increase memory traffic. The attention layers still amortize well, but the expert layers make the marginal cost of extra verified tokens less free.

Meta's MTP — Jointly Pretrained Future Token Heads

Meta's MTP (Better & Faster Large Language Models via Multi-token Prediction, Apr 30, 2024) was introduced primarily as a better pretraining objective. It asks the model to predict several future tokens, not only the next one. That can improve sample efficiency and downstream quality. For speculative decoding, the useful side effect is that those future-token predictors are already built into the model and can act as draft heads.

Architecturally, MTP is very close to Medusa. Several position-specific heads read from the same shared representation, and each head predicts one future token with a cross-entropy loss. When those heads are used for self-speculative decoding, the draft step is parallel rather than EAGLE-style recursive drafting. The serving path runs the shared trunk once, runs the future-token heads from that representation, then verifies the proposed tokens. In ordinary next-token inference, those extra heads can simply be ignored. The difference is what those heads are and when they are trained. Medusa attaches small residual FFN heads after the target model is already trained, with each head using its own vocab projection. Meta's MTP uses independent Transformer-layer heads with a shared unembedding matrix, and those heads are trained jointly with the shared trunk during pretraining. In other words, shared trunk, MTP head for offset $k$ , shared unembedding, token $t+k$ .

The practical limitation is horizon. MTP gives you a cheap built-in drafter, but it is still a short-horizon system. If you are far below saturation, MTP can leave verification capacity unused compared to longer-horizon systems such as DFlash or a deeper EAGLE tree.

EAGLE 2 — Dynamic Trees for Verification Budget

EAGLE-1 gives you a drafter. EAGLE-2 (EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees, Jun 24, 2024) asks the next serving question. Given this drafter, how should we spend a fixed verification tree budget?

Static trees waste budget because they spend nodes the same way regardless of the prompt. Some contexts are uncertain and need branching near the root; other contexts have an obvious next token and should mostly go deeper. In the figure above, "10+2=" should spend probability mass on the likely continuation "1" and then draft "2", rather than burning a node on the unlikely sibling "3".

EAGLE-2's thesis is simple. Drafter logits correlate with acceptance probability, so spend verification nodes where confidence is highest. The drafter is mostly unchanged; the improvement comes from building the tree dynamically, going deeper on confident branches and pruning weak siblings before they consume batch budget.

The practical caveat is that draft trees are only useful when you really have free FLOPs to spend. Every extra verification node adds less to expected accepted length than the previous one, because lower-probability branches and later positions have diminishing acceptance probability. Large trees also inflate the effective verification batch. A 30-node tree on a service handling 16 concurrent requests behaves like hundreds of verification rows, which can push you out of the memory-bound regime where speculative decoding helps. This is why large-tree results at BS=1 do not automatically transfer to production. At BS=1 you may have a whole GPU sitting underused, so even weak branches can be worth trying; in production, where verification nodes compete with real user batch, shallow dynamic trees are usually the sensible limit.

DeepSeek's Recursive MTP — Sequential Future Token Modules

DeepSeek's recursive MTP (DeepSeek-V3 Technical Report, Dec 27, 2024) takes Meta-style MTP and makes it sequential. Instead of predicting several future offsets independently from the same hidden state, the design predicts additional tokens through a chain of MTP modules. Later depths depend on representations produced by earlier depths, so deeper MTP would be closer to an autoregressive future trajectory.

Architecturally, each MTP depth has its own module. It consists of a shared embedding layer, a projection matrix, a Transformer block, and the shared output head. At depth $k$ , the module combines the previous-depth hidden representation with the embedding of the $k$ -th future token, projects them together, and runs a Transformer block. The important difference from Meta's MTP is that DeepSeek does not predict all future tokens as independent heads from the same hidden state; it keeps a causal chain across MTP depths.

The objective is still future-token cross-entropy, added as an auxiliary pretraining loss across MTP depths. So it is closer to Meta's MTP than to EAGLE. It predicts future tokens, not hidden features. DeepSeek-V3 sets the MTP depth to 1, so the released V3 setting predicts one additional token; the recursive chain matters when the design is extended to deeper MTP.

In deployment, the MTP module can become a built-in drafter. With depth 1 it is still a short-horizon system. If you are very far below saturation, it may still underuse available verification budget compared with DFlash or a deeper EAGLE tree.

FR Spec — Vocabulary Compression for Drafting

FR-Spec (FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling, Feb 20, 2025) is a systems optimization for speculative decoding, not a new drafter architecture. Its starting point is popularity bias. Tokens do not occur uniformly. In the paper's SlimPajama sample with the Llama-3-8B tokenizer, 75% of vocabulary tokens account for less than 5% of token occurrences. That long tail makes the full vocabulary expensive to support inside a tiny drafter.

For deep target models, the LM head is usually little $o$ . Dozens of Transformer layers dominate the cost. For tiny drafters, especially one-layer systems like EAGLE, that flips. The drafter has almost no backbone, so projecting into a 128k-token vocabulary and running softmax can become big $O$ .

The solution is to train or run the EAGLE-style drafter against only popular tokens. You lose a small amount of accepted length because the drafter cannot propose rare tokens, but rare tokens are exactly where it was unlikely to be reliable anyway. In exchange, the draft step gets materially faster. If the LM head is 40% of draft latency and FR-Spec makes it 3x faster, Amdahl's law gives:

$S = \frac{1}{0.6 + \frac{0.4}{3}} \approx 1.36\times$

In practice, choose the draft vocabulary from production data. A reasonable starting point is the smallest frequency-ranked vocabulary that covers the desired cumulative token-occurrence mass, then validate accepted length on production traces. This is aggressive enough to shrink the drafter's embedding and LM head, but not so aggressive that you lose common tokens the drafter could have proposed. FR-Spec is especially strong in narrow domains. If your workload mostly uses a tiny vocabulary slice, you can cut the draft vocabulary far below the generic frequent-token set and get more extreme speedups. The sweet spot is domain-specific traffic, lightweight drafter, large target vocabulary, and enough repetition that rare-token coverage is not worth paying for on every draft step.

EAGLE 3 — Direct Token Drafter With Multi Layer Features

EAGLE-3 (EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test, Mar 3, 2025) is the "scale the drafter" version of EAGLE. EAGLE-1 shows that a small draft model can follow the target model's hidden-state dynamics. EAGLE-2 shows how to spend the verification tree budget dynamically. EAGLE-3 asks why adding much more training data does not keep improving EAGLE-2, and identifies feature prediction itself as the bottleneck.

The problem is that future hidden features are a hard training target. They are high-dimensional, continuous, and not uniquely determined by the future token sequence. Predicting them is useful, but it constrains the drafter to solve a difficult feature-regression problem before it can do the thing we actually care about, which is proposing tokens that survive verification. This is why scaling training data gives limited returns in earlier EAGLE versions.

EAGLE-3 removes that bottleneck by switching to token-level supervision. Instead of forcing the draft output to match the target model's next feature, it only needs that output to produce the right token distribution through the LM head. It still uses the shifted sampled-token embedding, like EAGLE-1, but it no longer relies only on the top hidden state of the target model. The drafter concatenates target features from low, middle, and high layers and projects them with an FC layer into a fused feature, so it can read both local/token-level information and more semantic/high-level information from the target.

The "training-time test" idea is to train the drafter under conditions that look more like inference. In EAGLE-1 training, the drafter sees clean target-model features. During serving, after the first drafted step, later draft steps must condition on the drafter's own generated outputs instead. Those states are messier and off the clean target-feature manifold. EAGLE-3 modifies the training attention masks to simulate that multi-step draft process during training, so later positions learn to consume draft-produced states rather than only teacher-forced target states. That reduces train-test mismatch and helps the draft model benefit more from larger training sets.

Acceptance-Rate Target

Target at least ~0.55 per-token acceptance rate. If you're significantly below that, something is likely wrong — a bug in data generation, a mismatch in sampling parameters between training and inference, or noisy training data. For reference, EAGLE-3's reported acceptance-rate plots reach roughly 0.75-0.8 in its measured settings, so 0.55 is a conservative floor, not a ceiling. If you draft 3 tokens and each position is correct with probability 0.55 conditional on all previous positions being correct, then the expected accepted length is $1 + 0.55 + 0.55^2 + 0.55^3 \approx 2.02$ tokens per verification step. Equivalently, the expected number of accepted draft tokens is $0.55 + 0.55^2 + 0.55^3 \approx 1.02$ , so the average fraction of drafted tokens that survives is $\frac{0.55 + 0.55^2 + 0.55^3}{3} \approx 34\%$ .

ATLAS — Adaptive Online Speculator Selection

ATLAS (AdapTive-LeArning Speculator System, published Oct 10, 2025) is a production system built around the idea that a static speculator eventually goes stale. Together's design combines a heavyweight static speculator, a lightweight adaptive speculator, and a confidence-aware controller that decides both which speculator to trust and how much lookahead to use at each step. The important systems idea is that speculative decoding should adapt online as traffic shifts, rather than treating speculator training as a one-time offline artifact.

Retraining Cadence

Retraining frequency is a bias-variance tradeoff. If you train the drafter aggressively on only the most recent traffic, you maximize peak acceptance rate right after deployment, but the gain is fragile. As the serving distribution drifts, acceptance falls quickly and you need frequent retrains. At the other extreme, training on a broader and more diverse distribution acts like regularization. Peak acceptance is lower, but the drafter is more robust to domain shift and degrades much more slowly over time. In practice the best recipe is usually a mixture. Keep a broad base dataset so the drafter stays stable, then blend in a recent slice to recover domain-specific peak performance.

DFlash — Block Parallel Long Horizon Drafting

DFlash (DFlash: Block Diffusion for Flash Speculative Decoding, Feb 5, 2026) is the long-horizon answer to speculative decoding. EAGLE, Medusa, and MTP improve the drafter, but they still live close to the autoregressive world. DFlash asks whether drafting itself can be made block-parallel.

The problem is that autoregressive drafting has a horizon tax. Even if EAGLE-3 is strong, drafting many tokens still means walking through a draft process. When batch size is very far below saturation, that sequential draft path can leave parallelism unused. This is exactly the regime where a system that can draft a whole block at once becomes interesting.

DFlash uses a lightweight block diffusion drafter. Instead of predicting one token after another, it fills a block of future tokens in parallel. It conditions the diffusion drafter on target-model context features, including features extracted from multiple target layers, so the draft model is not trying to be a standalone LLM. It is closer to a context-conditioned draft adapter whose job is only to propose candidates that the target model can verify.

Training follows the same serving intuition. The diffusion drafter learns to denoise masked token blocks while conditioning on frozen target-model features. Each block begins from a known anchor/context token and then learns to fill the following masked positions. Early positions matter most, because one early rejection kills the rest of the accepted prefix, so weighting earlier draft positions more heavily is a natural fit.

Practically, DFlash is attractive when you are far below saturation and want long drafts. If you only have capacity for one to three draft tokens, it is probably overkill; EAGLE or MTP is simpler. But if you can verify 15+ tokens per request, DFlash is exactly the kind of system that can use that parallelism instead of leaving GPU capacity idle.

Aurora — Asynchronous Online Speculator Training

Aurora (When RL Meets Adaptive Speculative Training: A Unified Training-Serving System, submitted Feb 6, 2026) pushes the same direction further by framing online speculator learning as an asynchronous reinforcement-learning problem. Instead of training a speculator offline and hoping it transfers, Aurora continuously updates the speculator from live inference traces, using accepted tokens as positive feedback and rejected proposals as implicit negative feedback. The result is a unified training-serving loop where the speculator can be deployed immediately, improved without downtime, and kept aligned to drifting workloads.

LK Losses — Train for Acceptance and Not KL

LK losses (LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding, Feb 27, 2026) are not a new draft architecture. They are a better training objective for draft models. KL is a proxy for acceptance rate. If the drafter has enough capacity, the global optimum of the proxy matches the optimum of the target metric, making the draft distribution match the target distribution. But draft models are small, so they can get stuck in local optima where low KL is not the same as high acceptance. FR-Spec-style vocabulary truncation makes this even more indirect. You optimize KL against a truncated and renormalized target distribution, so the training signal becomes a proxy for KL, which is itself a proxy for acceptance.

The direct metric is distribution overlap. For one draft position, expected acceptance is $\alpha = \sum_x \min(p(x), q(x))$ , and this is exactly $1 - TV(p, q)$ .

Proof

For a token $x$ , speculative sampling accepts with probability $a(x)=\min(1,q(x)/p(x))$ , so the accepted proposal mass is

p(x)a(x)=p(x)\min\left(1,\frac{q(x)}{p(x)}\right)=\min(p(x),q(x))

Therefore the expected one-position acceptance is

\alpha=\sum_x \min(p(x),q(x))

Now use the identity

\min(p(x),q(x))=\frac{p(x)+q(x)-|p(x)-q(x)|}{2}

It holds in both cases. If $p(x)\ge q(x)$ , the right side becomes $q(x)$ . If $q(x)\ge p(x)$ , it becomes $p(x)$ .

Substitute it.

\alpha=\sum_x \frac{p(x)+q(x)-|p(x)-q(x)|}{2}

\alpha=\frac{1}{2}\sum_x p(x)+\frac{1}{2}\sum_x q(x)-\frac{1}{2}\sum_x |p(x)-q(x)|

Since $p$ and $q$ are probability distributions, $\sum_x p(x)=1$ and $\sum_x q(x)=1$

\alpha=1-\frac{1}{2}\sum_x |p(x)-q(x)|

By definition, $TV(p,q)=\frac{1}{2}\sum_x |p(x)-q(x)|$ , so $\alpha=1-TV(p,q)$

So maximizing acceptance is equivalent to minimizing total variation distance. TV is not just another divergence here; it is the divergence tied directly to speculative decoding speed.

The three losses prefer different draft shapes. KL tries to fit all target modes, including low-probability regions. Reverse KL is mode-seeking. It can fit only the densest part and ignore the rest. TV is closer to the speculative-decoding objective because it mostly cares about covering the probability mass that will actually overlap with the target.

Pure TV is theoretically right but practically annoying. Its gradient mostly tells you whether the draft probability for a token is above or below the target probability, not how wrong it is. With a randomly initialized drafter spread over a huge vocabulary, that signal can also be tiny. So TV starts from the correct objective, but optimizing it from scratch can be unstable and slow.

LK combines the stable proxy with the true objective. The hybrid version is $L_{LK}^{\lambda} = \lambda KL(p || q) + (1 - \lambda)TV(p, q)$ , where $\lambda$ is controlled by current acceptance. When acceptance is low, the loss is mostly KL because KL gives smooth, useful gradients. As acceptance improves, the loss shifts toward TV, because the drafter is close enough that direct acceptance optimization becomes useful.

Speculative Speculative Decoding — Draft Latency Hiding

Speculative Speculative Decoding (Speculative Speculative Decoding, Mar 3, 2026) is not another short-horizon drafter architecture. It overlaps drafting with verification by precomputing continuations for likely verification outcomes. In normal speculative decoding, the next draft waits for the verifier to tell you what prefix survived. SSD hides that drafting latency under verification. While the target model verifies the current draft, a separate draft worker prepares possible next drafts ahead of time.

The trick is that the drafter does not know the exact next prefix yet. A verification outcome is not only how many draft tokens were accepted; it also includes the bonus token sampled by the verifier after the accepted prefix. SSD caches next drafts for likely pairs of accepted length and bonus token. When verification finishes, if the real outcome is in the cache, the next draft is ready immediately. If not, the system falls back to normal drafting.

This points to where production improvements may come from after short-horizon drafters enter diminishing returns. Better acceptance still matters, but the next layer is infrastructure. Separate draft compute, asynchronous draft and verify stages, speculation caches, and better schedulers can make the drafter and verifier behave like one pipeline instead of two sequential steps.

RL Rollout Speculation — System Integrated Drafting

System-integrated speculative decoding for RL rollouts (Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding, Apr 29, 2026) is not a new drafter either. It asks where speculative decoding belongs inside RL post-training. Rollout generation is often the largest part of an RL step, especially for reasoning models with long outputs. If you accelerate rollouts losslessly, you speed up training without changing the policy distribution that the RL objective is supposed to optimize.

Rollout Stragglers

There is also a useful scheduler view here. Early in a large rollout batch, throughput is already high and speculative verification competes with real batch tokens. Near the tail, many sequences have finished, effective batch size collapses, and the remaining long generations become stragglers. That is when speculation should be cheapest. A rollout scheduler could spend newly idle FLOPs on the remaining sequences and reduce tail latency without paying speculative overhead for the whole rollout.

Verify Systems

The verifier problem is simple. The drafter gives a cheap proposal distribution $p(x)$ , the target model defines the distribution we actually want $q(x)$ , and usually $p(x) \neq q(x)$ . Sampling directly from $q$ is expensive because it means running the target model autoregressively for every token. Sampling from $p$ is cheap but wrong. Verification has to turn cheap samples from $p$ into exact samples from $q$ .

Monte Carlo Methods

Classical Monte Carlo gives two obvious tools. Inverse transform sampling can sample exactly from $q$ , but it requires the target distribution itself, so it does not save the target-model step. Standard rejection sampling samples from $p$ and accepts with probability proportional to $\frac{q(x)}{p(x)}$ , but it needs a global envelope constant $M \geq \max_x \frac{q(x)}{p(x)}$ . For language models that constant can be huge. One token that the drafter underestimates badly forces the whole sampler to reject often. That is mathematically correct but useless for latency.

Speculative Sampling

Now take one node in the draft tree. The drafter proposes a token from $p$ , but the verifier wants the next token to look as if it came from $q$ . If $p=q$ , there is nothing to do. Every draft token can be accepted. If $p(x)<q(x)$ for a token, the drafter underproduces that token, so accepting all proposed copies of $x$ is safe; we still do not have enough of it.

The only dangerous case is $p(x)>q(x)$ . Then the drafter overproduces $x$ . If we accept every proposed $x$ , the long-run histogram will contain too many copies of that token and the output distribution will not be $q$ . So we thin those proposals. We want accepted mass $q(x)$ , but proposals arrive with mass $p(x)$ , so the acceptance probability must solve $p(x)a(x)=q(x)$ :

$a(\tilde{x}) = \min\left(1, \frac{q(\tilde{x})}{p(\tilde{x})}\right)$

That decides what to do when the proposed token is accepted. The remaining question is what to do after rejection. Write the final one-step distribution as:

$P(X=x) = p(x)a(x) + P(\text{reject})u(x)$

We want the final output distribution to be $q$ , so set $P(X=x)=q(x)$ and solve for the residual distribution:

$u(x)=\frac{q(x)-p(x)a(x)}{P(\text{reject})}$

Now compute the numerator:

\begin{aligned} q(x)-p(x)a(x) &= q(x)-p(x)\min\left(1, \frac{q(x)}{p(x)}\right) \\ &= q(x)-\min(p(x), q(x)) \\ &= \max(0, q(x)-p(x)) \end{aligned}

Now compute the denominator:

\begin{aligned} P(\text{reject}) &= 1 - \sum_{x'} p(x')a(x') \\ &= 1 - \sum_{x'} \min(p(x'), q(x')) \\ &= \sum_{x'} q(x') - \sum_{x'} \min(p(x'), q(x')) \\ &= \sum_{x'} \left(q(x') - \min(p(x'), q(x'))\right) \\ &= \sum_{x'} \max(0, q(x')-p(x')) \end{aligned}

Combining the numerator and denominator gives:

$u(x)=\frac{\max(0, q(x)-p(x))}{\sum_{x'}\max(0, q(x')-p(x'))}$

So after rejection, we sample from the normalized positive residual of $q-p$

Speculative decoding does not guarantee identical generations — but neither does regular sampling with non-zero temperature. What it does guarantee is that the output distribution is preserved exactly. If you fix a prompt and sample the next token thousands of times, the histogram you get with speculative decoding will match the histogram without it. Individual samples may differ on any given run, but the distribution they are drawn from is mathematically the same. This is what makes the technique sound. It changes latency without changing the model's behavior in any statistical sense.

Offline Validation

Always validate output quality offline. Speculative decoding is lossless by construction, but the implementation has many surfaces for silent bugs. They won't crash your service — they'll quietly shift the output distribution. Run the target model with and without speculative decoding on the same prompts and diff the outputs; any divergence means a bug.

Judge Decoding

A separate frontier is lossy verification. Current lossless methods preserve the target distribution exactly, but the product requirement is often closer to preserving generation quality. Judge decoding follows that idea. It accepts tokens that are good enough, not only tokens that preserve the exact sampling distribution. This can improve acceptance, but it gives up the clean lossless guarantee.

Strict distribution matching is a harsh filter. The rejection sampler does not ask whether a proposed token is good. It asks whether accepting that token preserves the target distribution exactly. If the draft overproduces a reasonable token that the target model assigns lower probability to, verification must thin those proposals, even when the continuation would be perfectly acceptable to a human. This is the cost of being lossless. You preserve the exact distribution, but you throw away useful tokens to do it. Judge decoding relaxes this constraint — instead of demanding distributional equivalence, it uses a learned judge to decide whether a draft token is «good enough». This is lossy. The output distribution shifts, but if the judge is well-calibrated, the quality stays the same while acceptance rates go up.

What Works Today

The lossless short-horizon path is hitting diminishing returns. Recent deployable systems already combine several known ingredients: full-context conditioning, transformer-style draft modules, joint pretraining, hidden states shaped for future-token prediction, quantized serving paths, vocabulary compression, and acceptance-aware training. There may still be incremental gains, but the easy architecture wins are mostly gone.

For a model without a built-in drafter, EAGLE3 is the practical default. It is post-hoc, does not require changing the base model, and keeps the original output distribution intact through verification. If you control training, MTP can be part of the base-model fine-tuning path; if not, train EAGLE afterward.

The highest-leverage work now is the infrastructure around speculative decoding. Speculative decoding is less about inventing one more clever head and more about scheduling, adaptation, draft placement, and hiding draft latency under verification.

Local inference is different. When batch size is one, the machine may have enough idle FLOPs to justify longer drafts or DFlash-style systems. In production, those same verification slots compete with real user batch, so long drafts are usually much harder to justify.

Benchmark On Your Real Distribution

Benchmark on the same distribution the drafter was trained for. Speculative decoding speedup is data-dependent. Random prompts, synthetic garbage, or a different workload mostly measure distribution mismatch. The target model will still be correct, but accepted length and speedup will look worse than they should. Use representative prompts and production sampling settings.

Epilogue

If you made it this far, thank you for reading. This page is a live document, not a one-time post. I will keep updating it as new methods appear and as I get better intuitions about what matters in practice. If you spot a mistake, disagree with a framing, or want me to expand a section, send me a message — you can find my contacts here.