HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Meituan LongCat Team · Jianing Wang et al. · 2026-05-04 · arXiv:2605.02396
关键词: heavy thinking · parallel reasoning · sequential deliberation · agentic harness · test-time scaling · RLVR

速读卡片 (TL;DR)

一句话:把 Agent Swarm "orchestrator + K subagents + summarizer" 这套系统级工作流,抽象成一个 LLM 自身的内禀技能——parallel reasoning + sequential deliberation 两阶段,装订成一份可移植的 Markdown skill 文件,在 Claude Code 等 harness 内由模型自治执行,并可用 RLVR 进一步训练 width / depth。

2 stages

parallel + sequential deliberation

K=8/16

默认 trajectory 数

HM ≥ V@K

几乎全场跑赢 majority vote

立场:论文不发明新算法。它的贡献是承认 Agent Swarm 的精华其实只有"并行思考 + 总结"两步,可以从外部框架塌陷进模型自身;并用大规模实证(11 个模型 × 8 个 benchmark)证明这条 internalization 路径有效、可 RL 化。

1 · 动机:为什么把 Swarm 内化为 Skill

1.1 历史脉络:从 long CoT,到 Agent Swarm,再到内禀技能

2024–2025 年的 reasoning 故事可以拆成三条平行线索:

单链 long-CoT:o1 / R1 / MiniMax-M1 把 thinking budget 从几百 token 拉到 30k–80k token,是"深度方向"的 test-time scaling。但单条链有方差——一条错就全错。
Best-of-N / Majority Vote:Brown et al. 的 "Large Language Monkeys" 路线,采 N 条独立 trajectory 然后投票或 verifier 选优。便宜、并行,但不会跨轨迹合成:正确答案若只在 1/16 条出现,投票会把它丢掉。
Agent Swarm / Orchestrated Harness:Claude Code、Hermes、OpenClaw、Kimi K2.5 的做法——orchestrator 模型 spawn K 个 subagent 各自解决子问题、调 tool、写 memory,最后由 orchestrator 汇总。这是"在系统层"实现 width × depth scaling。

HeavySkill 的核心论点:Agent Swarm 看起来花哨(skills、memory、tools、subagent 调度协议),但驱动性能的真正机制只是 "K 条独立思考 + 一次批判性 summary"。既然如此,何必让 orchestration 框架重外部承担这件事?把它装订成一份 skill 文件,LLM 自己 in-context learning 就能执行——这就是 "Heavy Thinking as Inner Skill"。

Fig 1. 左:Agent Swarm 由外部 orchestrator 调度多个 subagent + summarizer,框架承担调度状态;右:HeavySkill 把这套流程压缩成一份 readable skill 文件,装入模型 context,由模型自己 in-context spawn 并行 thinker 与 deliberator。

1.2 别的方案为什么不够

把目前所有"用更多 compute 换 quality"的路线摆在一张表里:

路线	代表	width	depth	合成机制	失效模式
Long CoT 单链	o1, R1, MiniMax-M1	1	大 (30k+)	无	错则一错到底,方差高
Best-of-N + Verifier	PRM, MCTS	N	中	外部 verifier 打分	需要训练 verifier;漏选低频正确
Majority Voting	Self-Consistency	N	中	众数投票	正确答案频率 < 50% 时直接丢
Tree of Thoughts / MCTS	ToT, rStar	树状	中	外部启发式 + verifier	分支启发式难调,延迟高
Native parallel decoding	ParaThinker, Multiverse, Hogwild!	架构改	中	token-level 同步	需改 attention/positional;模型不可移植
Agent Swarm	Kimi K2.5, Claude Code	K subagents	各自 long CoT	orchestrator LLM summarize	系统耦合,框架专属;orchestration overhead
HeavySkill	本文	K=8/16	各自 long CoT	π_φ 内禀 deliberation	K 条 trajectory 占 context,要长 ctx 模型

关键观察:HeavySkill 与 Agent Swarm 的差别不是机制,而是机制存放在哪——前者放在模型 weight + skill prompt 里,后者放在外部框架代码里。Best-of-N 是 HeavySkill 取消第二阶段后的退化;ToT 是 HeavySkill 把第一阶段做成树状的工程变种。

1.3 为什么这事不平凡

把 K 条独立 trajectory 塞进一次 context 看起来像 trivial 的 prompt engineering,但有三件非平凡的事:

第二阶段必须是 critique 而不是 majority——论文反复强调 "deliberation is synthesis, not voting"。在 R1-Distill-Qwen-7B 上 1400 个 parallel pass-rate < 0.5 的 query 中,有 500+ 被 deliberation 救回,而 majority vote 必然丢光这部分。
HP@K 可以超过 P@K——这是论文最强的实证发现。即"summarizer 能合成出 K 条原始 trajectory都没有的正确答案",意味着 deliberator 不止是选择器,还是再推理器(re-reasoner)。这把 BoN 视角下的 P@K 上限击穿了。
RLVR on heavy mode 是新优化目标——传统 RLVR 优化单条 trajectory 的 outcome reward;HeavySkill 把整条 "K-thinkers + 1 summary" 当成一个轨迹来 GRPO/GSPO,等价于同时训练 width(parallel diversity)和 depth(deliberation 能力)。这是 brand new 的训练目标设计空间。

注意:论文标题里的 "Inner Skill" 不是说 "把 parallel reasoning 烧进模型 weight" 那种 architectural 内化(那是 ParaThinker / Multiverse 在做的事)。HeavySkill 的"内化"指在 context 里自治执行——同一个 LLM 既扮演 thinker 又扮演 deliberator,通过 skill 文档而非外部 Python 调度。

2 · 背景速查 (术语 / 度量)

缩写	含义
K	parallel reasoning 的 trajectory 数(本文默认 K=8 或 16)
K⁽¹⁾	sequential deliberation 阶段生成的 summary 数,默认 4
π_θ	thinker 模型(parallel 阶段)
π_φ	deliberator 模型(summary 阶段);实验中默认 π_θ = π_φ
x_c	serialized memory cache:把 K 条 trajectory 拼成一段 prompt
M@K (Mean@K)	K 条 trajectory 的平均 accuracy(基线)
P@K (Pass@K)	K 条里至少 1 条对的概率,即"模型潜在能力上限"
V@K (Vote@K)	多数投票准确率,等价于 self-consistency
HM@4 (Heavy-Mean@4)	summarizer 跑 4 次后的平均 accuracy
HP@4 (Heavy-Pass@4)	summarizer 4 次里至少 1 次对的概率
RLVR	Reinforcement Learning from Verifiable Reward
GSPO	Group Sequence Policy Optimization,Qwen 团队的 GRPO 改良(Zheng 2025a)

性能层级(论文 STEM 任务上经验观测): P@K ≥ HP@K ≥ HM@K ≥ V@K ≥ M@K。HM 跑赢 V 是 deliberation 比投票强的证据;HP 偶尔跑赢 P 是 deliberation 比"轨迹库选择"更强的证据。

3 · 方法 · 两阶段 pipeline

3.1 形式化

T_{π_θ}(q, K) = { y₁, y₂, ..., y_K }   // Stage 1: parallel reasoning
x_c = C(T_{π_θ}(q, K))   // 序列化 cache
T_{π_φ}(x_c, K⁽¹⁾) → final answer   // Stage 2: deliberation

每条 y_i 是 π_θ 在 temperature=1.0 / topp=0.95 / topk=10 下独立采样的完整 reasoning chain(含 internal thinking + final answer)。Stage 1 的独立性是核心约束——thinker 之间不见对方输出。Stage 2 把 K 条扔进同一个 context,让 π_φ 去 critique。

Fig 2. 两阶段 pipeline。π_θ 在 stage 1 跑 K 次独立采样得到 K 条 trajectory(其中可能有错的 y₃);全部进入 serialized memory cache,然后由 π_φ 在 stage 2 做 critique-style synthesis,而非简单 majority vote。

3.2 Stage 2 的 prompt 长什么样

论文 Figure 7 给了完整 prompt(Appendix C),核心约束有四条,值得记住因为这是 HeavySkill 区别于 self-consistency 的灵魂:

"It is generally believed that when most thinkers get the same answer, the answer may be correct. But you can't do it so superficially, because the correct answer may come from very few thinkers..."
"If you realize that none of these thinkers have answered correctly, you can even learn from the wrong experiences ... and re-think the given problem to give the answer you think is most correct."
"Please DO NOT just solve the given problem independently ... but summarize the thought process of all thinkers."
format consistency:数学题保持 \boxed{...},代码题保持 ```...```。

这套 prompt 把 π_φ 推向"同时是 verifier 和 fallback solver"的双重角色——既当 critic(评估每条 thinker 的对错)又当 backup reasoner(全错时自己解)。

4 · 方法 · Serialized Memory Cache

4.1 设计动机

K=16 条 reasoning trajectory,每条平均 ~10k token,总长 160k——多数 model 的 context window 也吃不下。所以 cache 不是简单 concat,要做两件事:

Pruning:剪掉每条 trajectory 内部 thinking 中冗余的 reflection/重新尝试部分,只保留 reasoning skeleton + final answer。
Shuffling:把 K 条随机重排,防止 deliberator 形成"位置先入"偏置(position bias),例如总是相信 thinker #1。

Fig 3. Serialized Memory Cache 的两步处理。Pruning 保留 reasoning skeleton 抛掉冗余 internal-thinking,使总长能进 deliberator 的 context;Shuffling 抹去原始顺序,防止 deliberator 形成"信前几位 thinker"的位置偏置。

4.2 Worked Example: tensor 视角

取 R1-Distill-Qwen-7B 跑 AIME25 一道题,K=16:

原始 16 条 trajectory: avg 12k tok / 条 → 192k token,远超 32k context window。
Prune 后 skeleton: avg 1.8k tok / 条 → ~28k token,留 4k 给 deliberator 写 summary。
Shuffle: 抽 random permutation σ ∈ S₁₆,按 σ 顺序拼。

Final prompt 结构:

You are a great reasoner...
====== Problem ======
Find n+t where ...
====== Thinkers Process Start ======
# Thinker #1 ← y_{σ(1)}: [skeleton]
# Thinker #2 ← y_{σ(2)}: [skeleton]
... (×16)
====== Thinkers Process End ======
Summarize from these ... give your answer.

反向论证:如果不 prune,整 prompt 直接爆 ctx,deliberator 看不到所有 K 条;如果不 shuffle,实测在 generation order = answer-frequency order 时,deliberator 会无脑 follow 第一条(论文 Appendix A 的 Max-Length / Max-Answer-Num 实验间接证实——位置/规模相关的偏置真实存在)。

5 · 方法 · Iterative Deliberation

第二阶段也可以多次迭代。在第 t ∈ {2..N} 轮,把上一轮的 deliberation output 拼回 cache:

x_c^(t) = T_{π_φ}(x_c^(t-1), K^(t-1)) || x_c^(t-1)

等价于让 summary 自己作为新 thinker 加入下一轮。直觉是"人类在难题上会反复 revisit 旧想法"。

Fig 4. Iterative deliberation:每轮的 summary output 被串回 cache,作为新的 "expert thinker" 进入下一轮 deliberation。论文 Figure 4 显示 HM@K 随 N 单调上升,但 HP@K 反而下降——典型的"探索 vs 利用"trade-off。

5.1 经验观察 (paper §4.3)

iteration N	HM@K (synthesis quality)	HP@K (potential)
1	baseline	baseline
2	↑	↓ slight
3	↑↑	↓
4	↑↑↑	↓↓

解释:迭代会让 deliberator 越来越"收敛"到一个 consensus(HM 上升),但同时把过往 summary 的偏置累积进 context,模型探索新解的能力下降(HP 上限被压低)。这是论文里少有的负面发现,值得记。

6 · 方法 · 从 workflow 到 readable Skill

6.1 Workflow vs Skill 的根本差别

这是论文标题里 "in Agentic Harness" 部分的核心。同样一套 two-stage pipeline 有两种实现方式:

	Workflow Mode	Skill Mode (HeavySkill)
谁负责调度	外部 Python pipeline	LLM 自己读 skill 文档,自治执行
spawn parallel	Python `asyncio.gather` 多次 API call	orchestrator LLM 在 harness 里调 Agent tool spawn subagent
memory cache	Python 字符串拼接	orchestrator 自己把 K 条 output 拼进 context
deliberation	第二次 API call,prompt 由代码塞	orchestrator 在下一个 generation step 内做
对 harness 依赖	需要写代码	只需 framework 支持 skill loading + subagent spawn
portability	每个 framework 重写	一份 markdown 通吃 Claude Code / Hermes / OpenClaw

6.2 HeavySkill.md 的四个组件

论文 Appendix C(Figure 8–10)给出完整 skill 文件,四块结构:

Activation Conditions:何时该激活——complex reasoning / 数学 / 算法竞赛 / 不确定 initial approach。何时不该激活——闲聊 / 简单 fact / 明显的 code edit。这一段是 cost gating,避免对 trivial 问题也烧 K 倍 compute。
Parallel Reasoning Protocol:指示 orchestrator 用 Agent tool spawn K=3~5 个 subagent(harness 模式下;workflow 模式下 K=8+)。每个 subagent 收到的 prompt 模板是 "Solve the following problem step by step ... independently"。
Deliberation Prompt:Figure 7 那段精心设计的 prompt(critique-not-vote / fallback-solver-when-all-wrong)。
Output Constraints:final answer only,no meta-analysis,format match domain(\boxed / code block)。

Fig 5. HeavySkill.md 的四组件结构。注意:这只是 plain Markdown,不含 framework-specific 代码,论文实测在 Claude Code 与 custom harness 都直接可用。

7 · Worked Example: AIME 一道题穿透

论文 Figure 1 用 AIME25 第 12 题作示例。我们把它作为完整 worked example,跟着 K=3 thinker 跑一次。

Problem: 在 0 < x < 2π 区间内,有 n 个 x 满足 f(x) = sin(7π · sin(5x)) = 0;其中 t 个 x 处图像与 x 轴相切。求 n + t。

Stage 1 — π_θ 独立采样 K=3 条 trajectory(temperature=1.0)

Thinker	策略	关键步骤	final
#1	代数法	sin(7π·sin(5x))=0 ⇔ 7π·sin(5x)=kπ ⇔ sin(5x)=k/7,k∈[-7,7]	\boxed{149}
#2	分情况枚举	对每个 k,在 5x∈(0,10π) 中数解;k=±7 给单根(切点),其余给双根	\boxed{149}
#3	数值粗扫	采样 1000 个点,数零点;漏 boundary,把 t 算成 12	\boxed{151}

Stage 2 — π_φ deliberation(同一模型,新一次 generation)

π_φ 接到的 cache 是 shuffle 后的 thinker #2, #3, #1 顺序。它的 output(节录,完整在论文 Figure 1 右下):

"Most thinkers follow a consistent and correct approach: they begin by determining when sin(7π·sin(5x))=0, leading to sin(5x)=k/7. For the answer inconsistent with the high frequency, [thinker #3 had ans=151], the discrepancy stems from boundary handling at k=±7 ... The answer is well-supported by careful counting and analysis. \boxed{149}"

关键观察:

π_φ 没有简单投票(虽然 2/3 都是 149),而是 explicitly diagnosed thinker #3 的错因(boundary)。
如果 thinker #1 / #2 都 149 但都解释错(即 P@K=0,所有 thinker 都靠运气),论文 prompt 第二条会让 π_φ 自己 re-derive。这是 HP > P 的机制根源。
整个过程没有外部 verifier,没有 MCTS,没有 PRM——只是在 context 里再 reasoning 一次。

反向论证:把 stage 2 换成 majority vote 会怎样?

替换后,2/3 = 149 → 投票胜出,本题答案仍正确。但在论文 §4.1 给的实验中,K=16, parallel pass-rate ∈ [0, 0.5] 的 query 中,multitude 是错的——majority vote 必然给错答案。HeavySkill 在这部分跑出 ≥500/1400 修正率,这就是 deliberation 真正赚到的钱。

8 · 实验关键结果

8.1 STEM (AIME25 / BeyondAIME / HMMT25 / GPQA-Diamond)

论文 Table 1 是核心证据。挑几行有代表性的(K=16):

Model	Bench	M@K	V@K	HM@4	P@K	HP@4
R1-Distill-Qwen-7B	AIME25	41.7	60.0	56.7	66.7	60.0
R1-Distill-Qwen-32B	BeyondAIME	31.4	46.0	44.3	59.0	49.0
Qwen3-32B	GPQA-D	69.0	69.7	70.3	88.4	76.3
DeepSeek R1-0528	AIME25	87.3	90.0	96.7	96.7	96.7
Kimi K2 Thinking	AIME25	95.2	96.7	99.2	100	100
GLM 4.6	HMMT25-Feb	90.4	96.7	99.2	100	100
DeepSeek V3.2 Thk	AIME25	93.5	96.7	100	100	100

读表要点:

HM@4 几乎在所有行都 ≥ V@K,即 deliberation > majority vote。少数行打平是因为 ceiling effect(像 AIME25 上 R1-0528 已经 ~90)。
HM@4 在前沿模型(K2 Thinking, V3.2, GLM4.6)上逼近 P@K——意味着 deliberation 几乎榨干了 K 条 trajectory 的潜力。
HP@4 偶尔超过 P@K(如 GLM 4.6 在 IMO Bench 上 86.0 vs P@K=75.1),即 deliberator 合成出了原始 K 条都没产生的正确解。

8.2 Tool-Interleave (Table 3)

把 Python interpreter 接进 thinker(Qwen3-8B/32B/GPT-OSS-20B),max 50 tool round:

Model	Bench	M@k	V@4	HM@4
Qwen3-8B	AIME25	55.7	68.3	76.7
GPT-OSS-20B	AIME25	69.8	83.3	90.0
GPT-OSS-20B	HMMT25	55.3	73.3	85.7

说明:tool-augmented trajectory 进入 cache 后,deliberator 仍能利用 execution feedback 帮助 critique——HeavySkill 不依赖 trajectory 是否含 tool,体系正交。

8.3 Cross-model deliberator (§4.2)

固定 π_θ = R1-Distill-Qwen-7B,换 π_φ:

π_φ	AIME25 K=16 HM	注释
R1-Distill-Qwen-7B	43.5 → 70.0 (P@16)	同模型自循环
R1-Distill-Qwen3-8B	↑↑	同 family 升级
Qwen2.5-32B-Instruct (12.8% 单独跑 AIME25!)	≥ baseline	反直觉:π_φ 自己根本不会做这题,但 deliberation 仍提升

论文据此论断:deliberator 需要的是 instruction-following + 综合归纳能力,而非 raw reasoning 强度。这等价于把 K 条 thinking 当成 RAG 文档去 synthesis 一篇 report——大 instruct 模型擅长这个。

Fig 6. Compute–accuracy trade-off 直观图。BoN/Vote 在中段;HeavySkill 把曲线往左上抬,逼近 P@K 上限,在 HP 维度上还能偶尔击穿。代价是 K+K⁽¹⁾ 倍 token 消耗。

9 · RLVR on Heavy Thinking (Appendix B)

这部分是论文最有生长性的内容。把整条 (K thinkers + 1 summary) 当一个 trajectory,用 GSPO 在 VeRL 上做 RLVR:

backbone: R1-Distill-Qwen-7B
data: 选 parallel pass-rate ∈ [0, 0.625] 的难题(还有提升空间的)
K ∈ {8, 16}, K⁽¹⁾ = 4
algo: GSPO,outcome reward = 最终 boxed answer 是否对

结果(Figure 6):

前 100 step,training score 与 test score 同步上升,HM@4 再涨 ~10%。
K=8 训练稳定;K=16 在 100 step 之后 entropy collapse。论文归因于 7B 的 max sequence length 限制——K=16 的 cache 把 ctx 撑爆,truncation 给出错误梯度信号。

个人 take:这是 HeavySkill 与之前 parallel reasoning 工作真正分叉的地方。Brown et al. 的 BoN 路线无法 RL(没有可微的"投票"过程);Multiverse / ParaThinker 改架构的路线则需要重新 pretrain。HeavySkill 把 parallel + delib 全部塑成 sequential context,自然兼容 GRPO/GSPO,policy gradient 一并优化 width(thinker diversity)和 depth(deliberation quality)。这给后续工作开了一道门:若有 long-ctx model + 长序列 RL infra,可以同时把 K 推到 32+ 并 RLVR 训 deliberator。

10 · 与同类工作对比

工作	parallel 来源	合成机制	位置	RL 化难度
HeavySkill (本文)	同一 LLM 多次采样	同一 LLM 在 ctx 内 deliberation	skill 文档 in context	★(顺序化轨迹直接 GSPO)
Kimi K2.5 Agent Swarm	orchestrator spawn subagents	orchestrator LLM summarize	外部 framework + 训出的 orchestration 行为	★★(多 turn,需要 multi-agent RL infra)
MiniMax-M1	none(单链 long CoT)	无	weight 内的 reasoning ability	★★(已是经典 RL reasoning)
INTELLECT-3	agent 多 step tool use	step-wise PRM-like	weight + harness	★★★(端到端 agentic RL)
ParaThinker / Multiverse	架构改:多分支 attention	token-level merge	weight	★★★(架构限制)
Group Think (Hsu 2025)	同步 thinking tag	token 级共享	prompt + weight	★★
Self-Consistency / BoN	多次采样	majority vote	解码外	不可 RL(投票不可微)
ToT / MCTS	显式搜索树	外部 verifier	外部	★★(过 PRM)

10.1 重点对比:HeavySkill vs Kimi K2.5 Agent Swarm

这两条线最容易混淆,但区别其实清晰:

维度	Kimi K2.5 Swarm	HeavySkill
设计目标	处理长期 agentic 任务(coding, search, multi-turn tool use)	处理单轮重型推理(math, GPQA)
subagent 角色	异质——planner / coder / critic 各司其职	同质——K 个 thinker 解同一问题
状态共享	memory + scratchpad 跨 agent	独立,只在 cache 阶段汇合
Orchestrator	独立 model(或 same with role)	无独立 orchestrator,deliberator 即 summarizer
"内化"程度	orchestration 在 framework	orchestration 在 skill prompt
什么场景占优	需要 division-of-labor 的复杂工程	正确性可验证的 reasoning

结论:两者不是替代关系,而是不同 granularity 的并行思考。Kimi 的 Swarm 适合"任务级 division of labor",HeavySkill 适合"reasoning-level redundancy + synthesis"。在一个完整 agent 系统里,HeavySkill 完全可以作为 subagent 内部的"思考方式"被 Kimi-style orchestrator 调用。

10.2 vs MiniMax-M1 单链

MiniMax-M1 把单链 long-CoT 推到极致(80k thinking budget);HeavySkill 把 budget 横向打散成 K 条短链 + 一次 deliberation。理论上两者 token 总量可比,但 HeavySkill 在 P@K 这个上限上更松——单链方差再小也不能"在一个错路上找到对路",而 K 条独立样本至少有一条对的概率显著更高。论文 Table 1 中 R1-Distill-Qwen-7B 的 M@K=42 vs P@K=67 就量化了这个 width gain。

11 · 局限 / 个人 take / 待验证问题

论文是 empirical study,不是新方法。Heavy thinking 的两阶段范式 Kimi K2、PaCoRe、LongCat-Flash-Thinking 都已经在用。HeavySkill 的贡献是 (a) 系统化命名 + 度量框架(HM/HP);(b) 把它压成 portable skill;(c) 大规模 cross-model 实证;(d) 验证 RLVR 兼容。
"Heavy thinking" 真的是inner skill 吗? 论文标题暗示 internalization,但实测仍是 prompt-level orchestration——deliberator 只是 in-context 跟着 skill 文档走。真正的 architectural internalization(让模型 weight 自然产生 K 条 trajectory)是 Multiverse / ParaThinker 在做。HeavySkill 是"in-context internalization",仍依赖 frontier model 的 instruction-following 能力。
Iterative deliberation HM↑ 但 HP↓:迭代会损失探索性。生产环境多用 N=1 即可,N=2~3 仅对极难题划算。
Cost 没认真讨论。K=16 thinkers + K⁽¹⁾=4 summaries 的 token 总量是 ~20× 单次推理。论文给了正面 quality 数字,但没给"在 fixed compute budget 下 vs long-CoT 的 Pareto 曲线"——这是现实部署最关心的事。
Ceiling effect 已经显现。前沿模型在 AIME25/HMMT 上 HM@4 ≥ 99,几乎 saturate。HeavySkill 的下一波价值在 frontier-of-frontier(IMO 级、AGI-级 benchmark)上,而不是在 AIME。
Skill 在非 Claude Code harness 的可移植性论文只 sanity check 过两个 harness,真正广泛部署还要看具体 framework 是否支持 "spawn subagent in parallel" 这种原语。

待验证问题

K 与 K⁽¹⁾ 的最优配比是什么?论文用 K=8/16, K⁽¹⁾=4,但没给 ablation 曲线。
把 deliberator 换成 PRM 风格的 step-wise verifier,是否能进一步逼近 P@K?
在 Code generation 上(LiveCodeBench)HeavySkill 提升明显,但代码合成 vs 数学 deliberation 的 prompt 应不同——通用 skill 文档是否够?
RLVR 训完 K=8 的 model,在 inference 时改用 K=16 是否还成立?宽度泛化未测。
K 条 trajectory 之间的 explicit diversity 干预(论文 Appendix A 的 Max-Diversity)被报道与 random 相当,这是因为 temperature=1.0 已经够散,还是 diversity metric 选错了?
把 HeavySkill 与 Kimi-style Agent Swarm 嵌套(Swarm 里每个 subagent 用 HeavySkill 思考)是否进一步涨点?这是工程上最直接的 hybrid。

记忆点 (Memory Points)

立场HeavySkill:Agent Swarm 的精华(parallel + summary)被压成一份 markdown skill,由模型自治执行,不依赖外部 orchestration framework。

机制两阶段 pipeline:π_θ 采 K 条独立 trajectory → serialized memory cache(prune + shuffle)→ π_φ critique-style deliberation(非 majority vote)。

指标性能层级:P@K ≥ HP@K ≥ HM@K ≥ V@K ≥ M@K。HM 跑赢 V 是 deliberation > 投票的证据;HP 偶尔超 P 是 deliberation 能合成新解的证据。

反直觉Deliberator π_φ 不需要强 raw reasoning。Qwen2.5-32B-Instruct 自己 AIME25 12.8%,但用作 deliberator 仍能涨点——deliberator 要的是 instruction-following + synthesis。

trade-offIterative deliberation 让 HM↑ 但 HP↓。迭代收敛 consensus 的同时丢失探索性。N=1 是默认实用值。

RL把 (K thinkers + summary) 整体当 trajectory,GSPO 直接训。R1-Distill-Qwen-7B 上 100 step 内 HM@4 再涨 ~10%。K=16 在 7B 上会 entropy collapse(ctx 不够)。

vs Kimi K2.5HeavySkill 是 reasoning-level redundancy(同质 K thinkers);Kimi Swarm 是 task-level division of labor(异质 subagents)。两者可以嵌套互补,不是替代。

vs MiniMax-M1M1 在 depth(80k 单链)上推到底;HeavySkill 在 width(K 条独立)上扩。HeavySkill 的 P@K 上限比 M1 单链方差小区间更松。

局限(1) "internalization" 是 in-context level 不是 architectural;(2) 论文没给 fixed-budget 下的 Pareto 曲线;(3) 前沿模型在 AIME 上已 saturate,需要更难 bench。

速记口诀"K 个 thinker 各自做,一个 critic 来评论;不投票要再推理,RLVR 一并涨。"