SRT: Speculative Rollout with Tree-Structured Cache

Cornell · UIUC · Tsinghua · UW · ByteDance · Chang & Zhu et al. · NeurIPS 2025 Workshop · 2026-01-14 · arXiv:2601.09083
关键词: RL post-training · speculative decoding · model-free draft · per-prompt token tree · run-ahead generation · GRPO/DAPO/PPO/ReTool

速读卡片 (TL;DR)

一句话:同一个 prompt 在不同训练 epoch、不同 group rollout 之间产出的 response 高度相似 — SRT 把这些历史 token 序列存成每个 prompt 一棵的 token 树,用它充当 model-free draft 给当前 policy 做 speculative decoding,免训练、零损失、即插即用。

2.08×

rollout 端到端最高加速

draft model 参数量 (model-free)

~70%

RL step 中 generation 占比

立场:SRT 不需要训 draft、不动 sampling 分布、不破坏 on-policy。它把"加速"交给一棵 CPU 内存里的 token 树 + 两条把树喂饱的策略 (on-the-fly insert + run-ahead during bubbles)。

1 · 动机

1.1 历史脉络:rollout 怎么从"小头"变成"大头"

2024 年之前的 RLHF:reward model + PPO,response 长度多在 256–512 token,generation 在一个 RL step 里只占 30–40%,大家把功夫花在 critic loss、reference model KL、replay buffer 上。

2025 年起 reasoning 模型起飞 (DeepSeek-R1, GLM-4.5, Kimi-1.5, Seed1.5-Thinking, Qwen3),trajectory 长度普遍 4k–32k token,GRPO/DAPO 还要求每个 prompt 同时采 5–16 个 response。论文 Figure 2a 给出的硬数据:在 PPO/GRPO/DAPO/ReTool 四个算法上,rollout 平均吃掉 65% (有时 >70%) 的 RL step 时间。这就是 Amdahl 视角下"必须被攻击的部分"。

更糟的是 long-tail 问题:

同一个 batch 内 response 长度高度不均衡 — 多数早早 EOS,少数 long-CoT 一路写到 max_response_length (DAPO 里 8k, ReTool 里 16k)。
同步 RL 必须等最后一条结束才能进 train 阶段 → 一堆 GPU 在"bubble"里空转等。

典型 GRPO/DAPO batch 的 rollout 时间分布:大半 GPU 在 bubble 里等长尾。SRT 的 run-ahead 策略就是来填这块 bubble 的(把空闲算力用来"预热"下一批 prompt 的 cache)。

1.2 别人的招数为什么"够呛"

加速手段	动了什么	代价
Async RL (AReaL, PipelineRL)	off-policy: 让 generation 滞后 learner k 步	policy lag → 收敛性可能受损
Selective rollout (GRESO, Act-Only-When-it-Pays)	跳过没信息量的 prompt	改了训练分布
FP8 / 量化 rollout	降精度生成	rollout 与 train 分布 mismatch
EAGLE-3 (NVIDIA NeMo-RL)	训一个轻量 draft head 复用 hidden	lossless · 需独立 offline init + 维护对齐
SuffixDecoding	retrieval-based draft, 把全语料后缀当 draft	model-free · 未利用 RL 跨 epoch / 跨 group 的同一 prompt 高重叠性
RhymeRL (并行工作)	缓存上一 epoch 的完整 response, offline async 更新	first-time prompt cold-start;cache 与本步 in-flight 状态脱节
SRT (本文)	per-prompt token tree + on-the-fly insert + run-ahead	lossless · on-policy · model-free · 无 cold-start

1.3 为什么 tree-structured cache 是"非平凡的解锁"

把"用历史 rollout 当 draft"这件事说出来不难,真正非平凡的是怎么存、怎么查、怎么填。考察以下三个观察:

跨 epoch 相似度高,但不是 verbatim 重复。Figure 2c 实测同一 prompt 在第 t 步和前 t-1 步累积 n-gram 的 overlap 随 epoch 单调上升,后期可达 60%+。意味着子串命中率高,但整段 response 命中率低 — 必须以"路径"为单位检索,不能以"句子"为单位。
GRPO/DAPO 同一 prompt 一次产 K=5–16 个 rollout。这 K 条彼此当下就高度相似 (相同 prompt + 相同 policy + 不同 sample,只差 stochastic decoding)。如果只在 epoch 边界更新 cache,会错过同一 batch 内这 K-1 条对第 K 条的帮助 — 这正是 RhymeRL 的盲区。
Long-tail bubble 是免费算力。Run-ahead generation 不会增加 wall-clock(它本来就 idle),但能给即将出现的 prompt 把 cache 提前烧热,消除 first-time cold-start。

把这三个观察落到数据结构上,Trie / token tree 是几乎唯一合理的选择:

共享前缀压缩存储 (一句话 "the cat sit on the mat" 和 "the cat is on the mat" 共享 "the cat")
沿路径累计频率统计 — 父节点 count = 子节点 count 之和,自底向上聚合
线性时间插入 (按 token 走一遍即可),CPU 内存就够,与 KV cache 无关
查询时:从当前已生成 suffix 的最长匹配出发,沿"概率最高的子节点链"贪心展开 → 一束 draft tokens

"tree-structured" 的真实含义. 注意:这棵树不是 KV cache 的拓扑,也不是 SpecInfer 风格的 draft token tree (那个是验证拓扑)。SRT 的树是per-prompt 的 token-frequency trie,存活在 CPU 主存里,跨训练 step 持续生长。它和 GPU 上的 paged KV cache 完全解耦。

2 · 背景速查

术语	含义
SRT	Speculative Rollout with Tree-structured cache (本文)
Per-prompt tree T_p	SRT 为每个 prompt p 维护的一棵 token-frequency trie,存所有历史 rollout 的 token 子序列
Model-free draft	不需要任何参数化模型,直接从 cache 取候选 token
On-the-fly insert	当前 batch 的 rollout 一边生成一边把 token 插进 T_p
Run-ahead generation	当 batch 大半 rollout 已 EOS,用空闲 GPU 给"未来 batch"的 prompt 提前生成 partial rollout 喂 cache
Verifier (target)	当前 RL policy π_θ,负责验证 draft tokens 是否被接受
SuffixDecoding	Oliaro et al. 2025: 用全局后缀树做 model-free draft,本文主要 baseline 之一
RhymeRL	He et al. 2025: 并行工作,只用上一 epoch 完整 response offline 更新 cache
N-gram drafting	Yang et al. 2023 (LLMA): 把全语料 n-gram 当 draft,本文另一 baseline
DAPO	ByteDance 数学 RL 算法,K=16/prompt,max_response=8k
ReTool	Feng et al. 2025: 多轮 tool-use RL,max_turns=8, max_response=16k
Verl	HybridFlow 的开源 RLHF 框架 (Sheng et al. 2025),本文集成进去

2.1 标准 speculative decoding 复习 (一行)

Draft 给 k 个 token → target 一次并行验证 k+1 个位置 → rejection sampling (Leviathan 2023) 保证输出分布与直接从 target 采样等价。SRT 只是把"draft 模型"换成"从 token tree 沿路径抽 token"。

3 · 核心机制:per-prompt token tree + 路径打分

3.1 数据结构

对每个 prompt p,维护一棵 trie T_p:

每个节点 u 对应一个 token 子序列 (从 root 到 u 的路径)
每个边由一个 next-token 标号
每个节点存 count(u) = 这个子序列在历史 rollout 里出现的次数
非叶节点的 count = 所有子节点 count 之和 (运行时维护即可,O(depth) 更新)

Per-prompt tree T_p 的具体形状。蓝节点 = depth-1, 黄节点 = depth-2, 绿节点 = depth-≥3 (用作 draft 候选)。蓝色 root 是空 prefix。SRT 当前在 prompt p 上生成到 "The answer", 沿树走匹配,从节点 "answer" 出发贪心采 "is"→"2" 作为 draft 提议给 verifier 验证。

3.2 查询过程 (drafting)

当前 partial continuation 是 y₁:t,SRT 这样查 T_p:

最长后缀匹配:从 y_t 往前回溯,找最长的 q 使得 y_{t-q+1:t} 是 T_p 上某条路径。这一步决定 "我们站在树的哪个节点 u_q"
如果连 q=1 都没匹配上 (e.g. 前所未见的 token),退化为正常 1-step 自回归,下一步再试
贪心扩展:从 u_q 出发,反复挑选 conditional 概率最高的子节点 v,即:
C(v) = count(v) / Σ_{w ∈ children(parent(v))} count(w)
路径分 = 沿路 C 的乘积 (类似 beam 但只保留 top-1)
扩展到预算 B(q) 用尽 (depth 限制 / 或路径分阈值) — 论文没给精确公式,但实现里通常是 fixed budget = 几十个 token
把这一束 draft tokens 一次性丢给 verifier (current policy) 做 spec decoding 标准验证

3.3 验证过程 (一次 forward, classical spec decoding)

Verifier (current π_θ) 把 [y₁:t, t̂₁, t̂₂, ..., t̂_K] 全部一次 forward,得到每个位置的 p_target。然后:

对每个 t̂_i,以概率 min(1, p_target(t̂_i | context) / p_draft(t̂_i)) 接受
第一次 reject 时,从 (p_target − p_draft)+ 重采样得到一个新 token,停止

这里有个 subtle 点:p_draft 是什么? SRT 的 draft 来自 cache 频率,不是一个真正的 distribution。论文 (隐含) 用 C(v) 作为 p_draft;这样 rejection sampling 的数学保证依然成立 — 因为 rejection sampling 只要求 p_draft 是某个合法分布,无所谓它有多差 (差只意味着接受率低,不意味着输出分布偏)。这就是"lossless"成立的关键。

反向论证:如果不做 rejection sampling,直接把 cache 里频率最高的 token 当输出 — 那 SRT 就退化成"非分布等价"的近似采样,会改 GRPO 的训练分布,违反 on-policy。所以 "tree-cache 提议 + rejection sampling 验证" 这两件事必须捆绑。

4 · Cache 维护:on-the-fly + run-ahead

树本身只是数据结构;怎么把树喂饱才是 SRT 的真正贡献。这一节讲两条互补的 maintenance 策略,以及它们各自解决什么。

4.1 On-the-fly insert (跨 group 内即时受益)

当前 batch 的 K 条 rollout 在生成的同时就把 token 流式插进 T_p:

GRPO/DAPO 一个 prompt 同时跑 K 条 rollout。先 EOS 的 rollout 1, 2 直接把整段 token 流插进 T_p;同 batch 还在生成的 rollout 3, 4, 5 立刻可以从这棵被刚刚喂饱的树里抽 draft。RhymeRL 错过的就是这个"同 batch 内"的快速反馈通道。

4.2 Run-ahead generation (用 long-tail bubble 预热未来)

另一条:当 batch 内只剩零星 long-tail rollout 还在跑、其它 GPU 已经空,SRT 不让 GPU 闲着,而是从 data loader 的 lookahead 窗口 / active prompt queue 里取下一个待训练的 prompt p',对它跑一段 partial generation:

这些 token 插进 T_{p'}(给未来用),
但不会被当成 learning targets — 真正的 rollout 还是要等到那个 prompt 被排到时,由当时的 policy 重新生成。

Run-ahead 的本质是"把 long-tail bubble 这块免费算力转移到未来 prompt 的 cache 预热"。这些 token 不计入训练梯度,只生只填。代价:可能多用一点 KV cache (但反正 GPU 闲着) + 一些调度复杂度。

论文 §4.2 给出消融:enable run-ahead 后 mean accepted tokens / step 进一步上升,而单独 history-only baseline (RhymeRL 风格) 由于 cold-start 在前几个 epoch 几乎没收益。

5 · Worked example:一个 prompt 跨多 step 的完整生命周期

把上面所有机制串起来,我们追踪一个具体 prompt p = "Q: 求 ∫x² dx" 在三步 RL 训练中 cache 是怎么生长、查询、命中的。

Step t=1 (cold start)

T_p 是空树。GRPO 采 K=5 条 rollout,全部走标准自回归。
5 条 rollout 完成,token 序列例如:
- r1: "Let me compute. ∫x² dx = x³/3 + C."
- r2: "积分 ∫x² dx 等于 x³/3 + C 。"
- r3: "We have ∫x² dx = x³/3 + C."
- r4: "Let me think. The integral is x³/3 + C."
- r5: "积分等于 x³/3 + C."
5 条全部插进 T_p。注意 "x³/3 + C" 这个尾巴在所有 5 条里都出现 → cache 里这条路径 count=5,极强 signal。

Step t=2 (on-the-fly 受益)

同一 prompt p 又被采样 (DAPO 里 K=16,假设这次也是 5)。
第一条 rollout 自回归生成,产到 "积分 ∫x² dx",在 T_p 上找到这个 prefix → 抽 draft "等于 x³/3 + C 。"。
Verifier 一次 forward 验证:假设 "等于 x³/3 + C 。" 全部接受 — 这一步本来要 12 token 自回归,现在 1 次 forward 完成 → α ≈ 12,巨大加速。
这条 rollout 一边生成一边把刚验证的 token 也插进 T_p (count++),后续 rollout 拿到更新过的更胖的树。

Step t=2.5 (run-ahead 预热)

step t=2 batch 里有 1 条 long-tail rollout 在 8k token 上挣扎,其它 GPU 空闲。
调度器从 lookahead window 取下一个 prompt p' = "Q: 求 ∫sin(x) dx",用当前 policy 跑一段 partial gen。
partial 产出 "积分 ∫sin(x) dx 等于 -cos(x) + C" → 插进 T_{p'},不计 loss,不留在 batch。
step t=3 的 batch 里真正轮到 p' 时,T_{p'} 已经是热的。

Step t=3 (持续累积)

policy π_θ 已经被 update 了两步,但跟 t=1 时差距不大 — Figure 2c 实测显示同 prompt 跨 epoch overlap 单调上升,所以 cache 里那些"x³/3 + C" 路径依然适用。
新的 5 条 rollout 与 cache 高度匹配 → α 持续高 → 速度持续优于 baseline。

关键张量/形状细节(展开)

Cache 占用: per-prompt tree 大小 ≈ K × max_response × avg_branching × node_size。Qwen2.5-1.5B + DAPO (K=16, max=8192) 估算 ≈ 几 MB / prompt。一个 epoch 几千个 prompt → 几 GB CPU 主存,完全可控。
Verifier 一次 forward 形状: input_ids = [prompt + y₁:t + draft_tokens]. 假设 draft 长度 K=8, prompt+history 已经 2k token, → batch=1 × seq_len=2008,标准 vLLM paged attention 一次过。
Rejection sampling 复杂度: O(K) 对一束 draft, 不是 GPU 瓶颈。

6 · 实验关键结果

6.1 主表 (Table 1, Qwen2.5-1.5B, on-policy)

Method	Step time (s) ↓				Generation time (s) ↓
	PPO	GRPO	DAPO	ReTool	PPO	GRPO	DAPO	ReTool
Baseline	47.5	42.9	81.7	74.8	31.5	31.8	44.1	49.0
N-gram	47.3	42.1	84.5	69.3	31.4	31.1	46.0	45.0
SuffixDecoding	35.9	30.7	103	69.2	18.4	19.7	62.5	38.8
SRT	31.5	26.2	68.7	58.8	15.2	15.4	31.5	28.7

读法重点:

SuffixDecoding 在 DAPO 上反而变慢了 (103 > 81.7) — 注意这个细节。SuffixDecoding 用全局后缀树作 draft,在 K=16/prompt 的 DAPO 设置下可能因为 draft 命中无关 token 浪费 verify 算力,或者后缀树太大导致 lookup overhead。SRT 因为 per-prompt tree 没这个问题。
N-gram 几乎没用 — 全语料 n-gram 与"这个 prompt 该生成什么"几乎不相关。
SRT generation time 在 ReTool 上 49→28.7 (1.7×) — 多轮 + 长 trajectory 是 SRT 的甜区:cache 累积更厚。

6.2 Ablation: K 越大 SRT 越甜 (Table 3)

GRPO setting	Baseline gen	SRT gen	improvement
n=5 responses/prompt	31.8	15.4	21.8% (gen)
n=10 responses/prompt	47.1	20.4	33.1% (gen)

这是 SRT 论点的核心实验性证据:每个 prompt 同时采的 rollout 越多,on-the-fly insert 提供的早期 hint 越多 → 后产生的 rollout 命中率越高。意味着 DAPO (K=16) / GSPO (K=64) 这种"高分组数"的现代算法上 SRT 收益会进一步放大。

6.3 Ablation: run-ahead 的边际收益 (Figure 5)

论文用仿真比较:

history-only (RhymeRL 风格): mean accepted tokens 上升缓慢,前期 cold-start 严重
+ on-the-fly insert: 立刻拉高
+ run-ahead: 进一步抬升,曲线最早期就站在高位

7 · 与同类工作对比

工作	Draft 来源	Cache 更新	损失/语义	定位
SRT (本文)	per-prompt token tree (model-free)	on-the-fly + run-ahead	lossless, on-policy	RL rollout 专属
RhymeRL (并行工作)	per-prompt 历史 response	仅 epoch 末 offline 更新	lossless	RL rollout, 但有 cold-start
SuffixDecoding (Oliaro 2025)	全局后缀树	静态/语料级	lossless	通用 inference,非 RL
N-gram / LLMA (Yang 2023)	全语料 n-gram retrieval	静态	lossless	通用 inference
SPEC-RL	训一个 EAGLE-style draft	RL 训练时同步更新 draft	lossless	RL rollout, 带参 draft
ReSpec	EAGLE draft + reward-weighted adaptation	online + reward-aware	lossless	RL rollout, 自适应 k/draft
NVIDIA NeMo-RL × EAGLE-3	EAGLE-3 draft head	offline init + 可选 online detach 更新	lossless, verifier-exact	生产 RL,32→2048 GPU
DAS	动态/自适应 draft 配置	—	lossless	RL rollout 调度

SRT 在这条线里的独特位置

它是唯一完全 model-free 的 RL spec decoding。EAGLE-3 / SPEC-RL / ReSpec 都要训一个 draft 模型并维护它的对齐;SRT 一行额外参数都不需要。
它对算法侧零侵入 — 直接接 Verl + vLLM 就能跑,GRPO/DAPO/PPO/ReTool 全适配。NeMo-RL 的方案需要把 EAGLE-3 head 嵌进 vLLM 的 inference path,系统集成成本高。
但它的加速天花板取决于"prompt 重复出现 + 跨 epoch 相似性"。如果数据集每 epoch 只过一遍 (no replay) 或 prompt 极多样,cache 就喂不饱。EAGLE 系不依赖这个假设。

互补,不互斥. 理论上可以把 SRT 的 tree-cache 作为 EAGLE-3 draft head 之上的 retrieval-augmentation,先查 cache 命中长 prefix,失败再 fallback 到 EAGLE-3 head。论文没做这个组合,但工程上是显然的下一步。

8 · 局限 / 个人 take / 待验证问题

论文承认的局限 / 我观察到的

实测只跑了 Qwen2.5-1.5B,8× Hopper。在更大模型 (8B+) 上 verify 的 forward 本身就更贵,SRT 的相对加速可能稀释。
Cache 大小没量化分析 — 大数据集 + 长 prompt + 多 epoch,CPU 主存占用可能爆。论文没给上限管理策略 (LRU? 频率阈值剪枝?)。
对 K (responses/prompt) 高度依赖。PPO (K=1) 上仍有效但收益最小;GRPO/DAPO 是甜区。
Run-ahead 的实现细节 (调度器怎么决定何时启动、怎么避免占用即将到来的 batch 的 KV cache) 论文几乎没写。
没和 EAGLE 系做正面 head-to-head 比较 (NeMo-RL/ReSpec/SPEC-RL),只比了 model-free baselines (N-gram, SuffixDecoding)。读者很难知道 model-free 路线相对带参 draft 的真实差距。

我的疑问 (落地后想验证)

Cache 漂移问题:RL 训到几百 step 后 policy 已经显著演变,早期插入的 token 可能"过期"。SRT 用 frequency 但没有 recency decay。在 long horizon RL (1k+ step) 上 acceptance 是否会衰退?
探索 vs 利用矛盾:如果 cache 里某条路径出现频率极高,SRT 会不断提议它 — 而 rejection sampling 虽然分布等价,但方差不一样;高方差 prompt 上 effective sample 数会变少,对 GRPO 的 advantage 估计可能引入偏置 (虽不破坏期望)。
多轮 (ReTool) 场景下,树是 per-prompt 还是 per-(prompt, turn-history)?论文没说清楚。如果 per-prompt 但 turn 之间状态差异大,cache 会很乱。
与 EAGLE-3 / ReSpec 组合:cache 命中率高时直接取,低时 fallback 到 EAGLE 模型。系统上是否值得这复杂度?ablation 缺位。
Run-ahead 的"未来 prompt"可能已经不是 next batch 的 prompt:大规模分布式训练里,sampler 从全局打乱,run-ahead 怎么准确预测?
KV cache 与 paged-attention 的交互:verify 一束 draft tokens 时,K+1 个位置一起进 attention,在 vLLM 的 page table 里要预留连续 page。SRT 在 long-context (32k+) 下的 OOM 风险?

记忆点

立场 RL rollout 的 redundancy 是金矿;tree-cache + on-the-fly + run-ahead 是不需要训 draft 的 lossless 加速
数据结构 per-prompt token-frequency trie,CPU 内存,跨 step 持续生长
两条 maintenance on-the-fly (跨 K rollout 即时受益) + run-ahead (用 long-tail bubble 预热未来)
公式 draft 路径分 = Π C(v) = Π count(v)/Σ count(siblings)
实测 Qwen2.5-1.5B, GRPO/DAPO/PPO/ReTool, gen 时间 1.5–2.08× 加速
甜区 K 大 (DAPO K=16) + prompt 多次重复 + 长 response
死穴 prompt 极多样 / no replay / 短 response → cache 喂不饱

精读笔记 v1 · 2026-05-07 · 配套论文 PDF 在 /data/szhang967/papers/spec-rl/SRT_2601.09083.pdf