ToolMind Technical Report: A Large-Scale, Reasoning-Enhanced Tool-Use Dataset

Nanbeige Lab, BOSS Zhipin (直聘) + Gaoling School of AI, Renmin University of China · Chen Yang, Ran Le, Yun Xing, Zhenwei An, Zongchao Chen, Wayne Xin Zhao, Yang Song, Tao Zhang · 2025-11-12 (v1) → 2025-12-05 (v2)
arXiv:2511.15718 · HF Dataset: Nanbeige/ToolMind · HF Org: Nanbeige · ToolMind-Web-3B (后续衍生 search agent)
关键词: tool-agentic SFT · function graph · 三角色 multi-agent simulation · turn-level filtering · BFCL-v4 / τ-Bench / τ²-Bench · Qwen3-8B / 14B

速读卡片 (TL;DR)

一句话: ToolMind 是 BOSS 直聘 (Nanbeige Lab) + 人民大学高瓴 Wayne Xin Zhao 组联合产出的 360K 规模 tool-agentic SFT 数据集(163,180 条合成 trajectory + 205,431 条 augmented open-source,合计 368,611,论文叫"约 360k")。整条 pipeline 的方法论新意是: (i) function graph + random walk 从 20K open-source 函数(xLAM-60K + Glaive + ToolACE)里采出 function chain; (ii) 三角色 multi-agent simulator(user / assistant / tool 三个 LM 各扮一角)产 trajectory; (iii) 双层质量过滤 —— trajectory-level 二元 LLM judge + turn-level 二元 LLM judge,后者是核心 novelty,通过 masking 而非 dropping 把 trajectory 内的错误 turn 单独剔除,只在剩余 valid turn 上算 loss。Qwen3-8B SFT 后 BFCL-v4 +4.69 pp (42.21→46.92)、τ-bench +10.87 pp、τ²-bench +11.77 pp;14B 在 τ-bench 上 +14.22 pp (38.78→53.00) 最显著。Apache-2.0 dataset 全开源,但训练好的 ckpt 没发布(后续 ToolMind-Web-3B 是改向 deep-search 的衍生品,不是 paper 主线 SFT model)。

160K + 200K

synthetic + augmented = 368,611 samples

20K+ tools

来自 xLAM/Glaive/ToolACE 三源 function pool

Turn-Level Filtering

论文核心 novelty: mask 单 turn 不丢 trajectory

+14.22 pp

14B 在 τ-bench 上最大涨幅 (38.78 → 53.00)

立场: 这是一篇方法论小新意 + 工程性扎实 + 评测诚实的 technical report,定位是"#22 TOUCAN 的方法论补丁版" —— TOUCAN 在 trajectory 级用 6 维 Likert + GPT-OSS-120B 过滤,ToolMind 在此基础上多加一层每个 assistant turn 单独二元判 + masking 而非丢弃,从而保留"self-corrective reasoning"信号(论文 abstract 原话)。规模上比 TOUCAN 小 4-10× (368K vs 1.5M raw / 119K SFT),但使用了真 MCP server 之外的另一条路线:tool 是 schema,response 由 tool agent LLM 模拟 —— 这点和 ToolACE / Nemotron 同类(论文也明确承认"tools are not executable in our setup")。和 #22 TOUCAN、#23 EnvScaler、#20 SETA、#18 AWM 一起构成 2025-10 → 2026-01 的 tool-agentic SFT 数据集军备竞赛。

1 · Motivation — "trajectory 级过滤会漏掉 turn 级错误"

ToolMind 的 motivation 段落几乎可以一句话总结: "现有 multi-turn 合成数据只验整条 trajectory,turn 内的错误会随训练放大"。原文 abstract 第二句直接点名:

"Most existing works on multi-turn dialogue synthesis validate correctness only at the trajectory level, which may overlook turn-level errors that can propagate during training and degrade model performance." — abstract

这一立场和 #22 TOUCAN 形成直接对比: TOUCAN 用 6 维 Likert (difficulty / uniqueness / quality / realism / verifiable / stability) 在 trajectory 级别过滤,然后整条 trajectory 要么全留要么全丢。ToolMind 提出的反例是: 一条 trajectory 总体上 coherent、目标达成,但中间可能有 1-2 个 turn 出现:

失败中间尝试 (failed intermediate attempts) — assistant 调错工具,被 tool agent 返回错误,然后改正。这种 self-correction 是宝贵信号还是噪声?ToolMind 的答案是: 看具体 turn — 错误 turn 应当 mask 掉,纠正 turn 留下。
基于错误假设的 tool call (tool calls based on incorrect assumptions) — assistant 在 user intent 没说清楚的时候硬填参数。这种 hallucination 一旦被训进模型就是"过早 commit"的坏习惯。
角色漂移 (role drift in user/tool messages) — user agent 偶尔会跳出来给建议,tool agent 偶尔会用自然语言回复而不是 JSON。这种角色一致性破坏无法在 trajectory 级用 "整体连贯吗" 这种 coarse 问题筛掉,但 turn-level judge 能精确指认是哪一条 message 出问题。

论文 §3.3.2 原文:

"Since trajectory-level filtering operates at a coarse granularity, it cannot effectively isolate erroneous turns within otherwise coherent trajectories. Trajectories may still contain erroneous or misaligned turns—such as failed intermediate attempts, tool calls based on incorrect assumptions, or role drift in user/tool messages. To further improve data quality, we address these remaining issues by applying a separate LLM-based judge to each turn. The model then masks invalid or role-inconsistent steps and retains only correct procedural actions, ensuring the final training data targets effective problem solving and precise tool use." — §3.3.2

注意 abstract 里特意加的一句"while preserving self-corrective reasoning signals essential for robust tool-use learning" —— 这是对"全丢"路线的 implicit 批评: 如果用 TOUCAN 风格直接丢掉含错 trajectory,你也丢掉了"错完之后恢复"这个最宝贵的 RLHF-like 信号。turn-level mask 让"恢复 turn 留下"成为可能,而"错 turn 不计 loss"避免训出 hallucinate 习惯。这是论文方法论的核心 selling point。

个人观察: 这个 motivation 是真有内容的 — 但它不是 ToolMind 独有。#29 EnvTuning 的 fine-grained progress reward 也是 per-turn signal,只不过是用在 RL 里;#22 TOUCAN 的 rule-based filter 实际上也部分 per-turn(查 tool-call 顺序)。ToolMind 的真正贡献是把 turn-level LLM judge + masking-not-dropping 形成一个可复现的 SFT pipeline,并配出消融表证明 +2.76 pp on BFCL-v4 (Table 4: (a) 46.87 vs (b) 44.11)。这个 ablation 落差不算特别大,但确实显著且方向一致。

2 · Function Graph 构造 — random walk 采 function chain

这是论文 §3.1 的内容,也是合成数据 diversity 的源头。整个思路用一张图说清楚:

我是 eCom 仓库管理员,先查库存位置,再核 XYZ789 库存,最后检 IP" → 三角色 multi-agent 演出完整 multi-turn 对话(详 §3) ⑥ Quality Filtering (双层) trajectory-level binary judge → turn-level binary judge (mask 单 turn)

图 1: ToolMind 数据合成 pipeline 全景。本文图 2 的中文重绘 — 论文流水线核心是 ① 收集 20K 函数 → ② 把每个 input/output 参数都 embed → ③ 用 max cosine similarity + LM validator 建有向图 → ④ random walk 采 5-20 步的 function chain → ⑤ chain 喂给 user-agent 编 intent + 三角色 multi-agent 演 trajectory → ⑥ 双层 quality filter。

2.1 函数集合与预处理

函数池不来自 MCP,而是从三个 open-source dataset 抠出 schema:

xLAM-function-calling-60k (Salesforce, CC-BY-4.0)
Glaive-function-calling-v2 (GlaiveAI, Apache-2.0)
ToolACE (华为,Apache-2.0;校订 2026-05-18: 早期标 NC 是错误,实测 HF cardData.license=apache-2.0,见 #35 ToolACE)

对每个函数 f_i: X_i → Y_i,如果输入 / 输出参数缺 description 或 type,就用 LM 推断补全(论文 Appendix A.1 给了完整 prompt: "Return only one result in JSON format with two fields: input description and input structure...")。

2.2 参数级 embedding + 边构造

对每个参数 r ∈ X_i ∪ Y_i,用统一 embedding model 编码:

v(r) = ϕ( DESC ‖ desc(r) ‖ TYPE ‖ type(r) ) ∈ ℝ^d

然后对每对函数 (f_i, f_j),计算最大输出-输入 cosine:

s_ij = max _{y∈Yᵢ, x∈Xⱼ} sim( v(y), v(x) )

超过阈值 τ 的就建一条有向边 i → j,再用 LM 当 validator(论文 Appendix A.1 又给了一段 prompt: "Please determine whether ... the output result of source_function is suitable as the input parameter of target_function. ... assign a score from 0 to 9 for the following aspects: field transitivity, coherence of potential user intent paths.")二次过滤边的可靠性。最后再注入一点随机性"to introduce randomness into edge construction" — 防止 graph 退化成只连超热门 hub。

2.3 random walk 采 chain

从 graph 上做长度 5-20 的有 per-node visit limit 的 random walk:

W = (f₀, f₁, …, f_L), L ~ Uniform{5, …, 20}

per-node visit limit 防止"oversampling specific nodes"。每条 walk 就是一个 function chain,后面被 user-agent 翻译成"我是仓库管理员,先做 X 再做 Y 最后做 Z" 这种 user intent。

个人观察: graph 方法是 ToolMind 最优雅的一点 — 它把"多 tool chain 的语义连贯性"从"让 LLM 自己想"(像 TOUCAN 的 Mistral-S/DevStral-S 出题)变成了"从一张 parameter-similarity graph 上做 random walk"。优点: chain 一定有 "前一步输出可以喂下一步输入" 的真语义依赖(被 LM validator 双重检查过);缺点: 这种依赖只是"type/desc 相似",不是"值真的能传" — 实际 trajectory 里 tool 是 simulated 的,所以这种 type-level fit 够用。但和真 MCP server 路线(TOUCAN) 比,你不会撞到"输出 schema 是 dict[user_id] 但下游需要 str[email]"这种现实不匹配。

3 · Multi-Agent Synthesis Framework — user / assistant / tool 三角色

这一节(论文 §3.2)是方法论上"小创新但工程实在"的部分。注意一点: tools 不可执行。原文:

"Since the tools are not executable in our setup, we employ a simulation framework to generate responses." — §3.2

因此整个 trajectory 由三个 LM agent 协同演出:

角色	职责	Appendix A.2 关键 prompt 片段
User Agent	基于 function chain 编出来的 intent 提问、推进对话、出现 ###STOP### 时结束	"Do not reveal your entire intent at once. Unfold your needs gradually over multiple turns. Use common human conversation strategies, such as showing uncertainty when appropriate."
Assistant Agent	回复 / 索澄清 / 调用工具(这是要训的角色 — 也是 SFT loss 加在的地方)	(§3.2: "Responds to the user's queries, provides guidance, asks for clarification, or calls external functions when needed.")
Tool Agent	模拟 server 执行结果,返回 JSON in `<func_return>` tag	"You are simulating a high-performance computer system with complete computational capabilities. ... For a given function, you should simulate the execution of a computer system program as accurately as possible. ... If the function call is invalid (e.g., incorrect function name, missing or malformed arguments), return a JSON error message clearly indicating the cause."

三个 agent 在 loop 里走: user → assistant (可能 tool_call) → tool → assistant → … 直到 user agent 判定 intent 达成,输出 ###STOP###(注意 prompt 严格规定 "Do not output '###STOP###' in the first turn, regardless of the conversation!!" — 防止过早收尾)。

3.1 关键设计取舍

"intent ≠ query" — user agent 拿到的是 underlying intent,不是直白 query。论文原话: "A user intent represents the underlying goal that a user aims to achieve, rather than the exact query they pose. This distinction provides greater flexibility in modeling user behavior, as users often express the same intent in diverse ways." 这给了 user agent 重新措辞的空间,产 dialog 更自然。
incremental disclosure — user 不一次性把 intent 全说完,被 prompt 强制"show uncertainty when appropriate"。这是 ToolMind motivation 1.5 那段说的"user requests are often under-specified"对应解法。所以训练数据里会有大量 assistant 主动 clarify 的 turn — 这是 BFCL-v4 multi-turn / agentic 子项的 critical 能力。
tool agent 故意会模拟 error — prompt 明确说 "If the function call is invalid (...), return a JSON error message"。这是论文有意让"error handling + recovery"信号进入数据。
具体哪些 LM 当 agent 论文没写 — Appendix B 给了一条完整 trajectory 示例(15 页 PDF 的最后 3 页),从 thinking style 看像是 Qwen 系。未在公开来源确认

Appendix B trajectory 示例片段 (eCom 仓库管理员 intent)

function chain: getStockLocations → checkInventory → checkIpAddress

User intent: "I'm a warehouse manager for an eCommerce company. First, I need to check
where our inventory is stored by retrieving the stock locations. Then, I want to verify
the availability of product 'XYZ789' specifically at 'warehouse C' to plan restocking.
Finally, I need to ensure our system's security by checking if the IP address
192.168.1.100 is safe."

user: Can you retrieve the current stock locations for me?
assistant: <think>Okay, the user is asking to retrieve the current stock locations. ...
The user didn't specify a limit, so I don't need to include that. Since the function
doesn't require any parameters, I can call it without any arguments. ...</think>
<tool_call>{"name":"getStockLocations","arguments":"{}"}</tool_call>
tool: {"stock_locations":[{"id":1,"name":"Warehouse A",...},{"id":3,"name":"Distribution
Center C","address":"789 Pine Rd, Countryside","capacity":10000}],"pagination":{...}}
assistant: <think>... I should present this information clearly. ... </think>
Here are the current stock locations retrieved: 1. **Warehouse A** ... 2. **Store B** ...
3. **Distribution Center C** - Address: 789 Pine Rd, Countryside - Capacity: 10,000 units
...
user: Can you check the availability of product XYZ789 at Distribution Center C?
assistant: <tool_call>{"name":"checkInventory","arguments":{"product_code":"XYZ789"}}</tool_call>
tool: {"product_code":"XYZ789","location":"Distribution Center C","available_quantity":150,
       "last_updated":"2023-10-05T14:30:00Z"}
...
user: ###STOP###

注意 assistant 风格: 显式 <think> ... </think> + <tool_call>{...}</tool_call> — 这是 Qwen3 / Kimi-K2 / GLM-4.5 系的 chat template。论文 §4.1 也明确说: "Each asisstant response is structured into three parts: think, content, and tool calls. During training, we compute loss only on the tokens of the anchored assistant response." — 所以 ToolMind 是把 think trace 一并训进 SFT 的,这就解释了 paper 名字里的 "Reasoning-Enhanced"。

4 · Turn-Level Filtering — 论文核心 novelty

论文 §3.3 把过滤分成两个 stage,prompt 也在 Appendix A.3 列出来了。这两个 prompt 长得几乎一样,但判定单位不同 —— 这正是论文的 key insight。

4.1 Trajectory-level filtering (Stage 1)

给完整 multi-turn 对话,LM 二元判 0/1:

"Please strictly evaluate the quality of the following multi-turn dialogue data based on the following criteria: contextual coherence, role consistency, logical soundness, and accuracy of tool usage. Your task is to make a binary judgment — if the dialogue is of good quality, output 1; otherwise, output 0." — Appendix A.3.1

另外还有第一道检验: "the user agent first judges whether the task is completed or not during the process of multi-turn interaction" — user agent 自己判定 intent 达成,达成才进 stage 1。论文原话:

"Given the predefined user intents, the user agent first judges whether the task is completed or not during the process of multi-turn interaction. Then, a second LLM evaluates goal coverage, coherence, and cross-turn relevance. Only trajectories that both align with the intent and exhibit a coherent flow are retained, while the rest are discarded." — §3.3.1

4.2 Turn-level filtering (Stage 2 — 核心 novelty)

给完整 conversation history + 最后一条 response,LM 二元判 0/1:

"Please strictly evaluate the quality of the last response in the following dialogue data, based on contextual coherence, logical consistency, and accuracy of tool usage. Determine whether the response is semantically aligned with the previous dialogue, logically sound without contradictions, and employs the mentioned tools correctly according to their definitions and argument structures. If the response is of good quality, output 1; otherwise, output 0." — Appendix A.3.2

不同于 trajectory filter 的丢弃行为,turn filter 走的是 mask 路线: 标记为 0 的 turn 不参与 loss(论文 §3.3.2 末尾的"masks invalid or role-inconsistent steps")。然后再做按 turn 切样本: §4.1 原话 "each multi-turn conversation is split at each assistant message into multiple samples. For each splitted sample, we retain the full prior context up to (and including) the anchor assistant message and discard all subsequent turns. ... During training, we compute loss only on the tokens of the anchored assistant response."

所以一条 N-turn trajectory 最终会变成 N 个样本(每条样本 anchor 在不同的 assistant turn 上),但只有 turn-level filter 判 1 的那些 anchor 会进入最终训练集。这就是 §3.4 那段:

"We split each trajectory into multiple samples using the turns that passed the turn-level quality filter and analyze both trajectories (orange) and post-split samples (blue)." — §3.4

4.3 为什么这么做 — Table 4 消融

论文 Table 4 给了直接证据 — 用 Qwen3-8B 做消融:

配置	τ-bench Avg	τ²-bench Avg	BFCL-v4 Overall
Qwen3-8B baseline	35.83	34.64	42.21
(a) only synthesized data 带 turn-level filter	42.31	38.85	46.87
(b) only synthesized 不带 turn-level filter	35.31	41.73	44.11
(c) only augmented open-source	48.65	42.16	45.88
full ToolMind = (a) + (c)	46.70	46.41	46.92

关键差: (a) − (b) 才是"turn-level filter 净效应":

τ-bench: +7.00 pp (42.31 vs 35.31) — 这是论文 turn-level 的最强证据
τ²-bench: −2.88 pp (38.85 vs 41.73) — 反方向,turn-level 反而轻微伤
BFCL-v4: +2.76 pp (46.87 vs 44.11)

诚实诊断: turn-level filtering 在 τ-bench / BFCL-v4 上确实有用,但在 τ²-bench 上反而拖了 2.88 pp。论文没有解释这个反例。我的猜测: τ²-bench 是双侧 control(user 也有工具)的环境,你需要保留更多 "错完之后纠错" 的 turn 才能学到 dual-control coordination — 而 turn-level filter 太严的话可能把这些"必要的中间错误"也 mask 掉。这意味着 turn-level filter 的"strict binary"设计在某些场景下是过紧的 — 比 trajectory-level filter 的 6-维 Likert 损失更多 nuance。

5 · Dataset Composition — 360K 的内部精确拆分

论文 Table 1 给了非常清楚的拆分。注意名字虽然叫 "160K + 200K" (= ~360K),但实际 process 后的精确数字是:

数据源	原始 trajectory / query	process 后 sample 数	license / 来源链接
xLAM-function-calling-60k	60,000	101,363	Salesforce, CC-BY-4.0 (注意 xLAM 系列原版 60K 是 CC-BY,部分文献误标为 NC)
When2Call (SFT split)	15,000	17,531	NVIDIA, CC-BY-4.0 (NAACL 2025 Findings)
Glaive-function-calling-v2	112,960	20,017	GlaiveAI, Apache-2.0 — 被切掉 82.3% (大量 multi-turn 没过 filter)
ToolACE	11,300	7,327	华为, CC-BY-NC-4.0 — 非商用来源
BUTTONInstruct	8,000	21,202	PKU + 百川,arXiv:2410.12952
APIGen-MT-5k	5,000	25,109	Salesforce,arXiv:2504.03601
τ-bench train set	—	12,882	Sierra Research, MIT
合计 augmented open-source = 205,431
Synthesized trajectories (graph + multi-agent)	40,161	163,180	本论文新合成
ToolMind Total		368,611	composite license (取最严: CC-BY-NC-4.0)

license 校订 (2026-05-18): 早期版本担心 ToolMind 含 ToolACE → 继承 NC,这个判断错了。实测 Team-ACE/ToolACE HF cardData.license = apache-2.0(curl 验证),见 #35 ToolACE。所以 ToolMind 的 7 个来源数据集里,真正的 NC 风险来自 xLAM-60K 的姐妹 xLAM-function-calling-60k(实际是 CC-BY-4.0 BY-only,不是 NC),其他来源 (Glaive Apache / ToolACE Apache / APIGen-MT Apache / BUTTON Apache / When2Call Apache / τ-bench MIT) 都干净。ToolMind 完整版可商用,license 上和 TOUCAN 同档。

5.1 Synthesized 部分: 40,161 → 163,180

一条 raw trajectory 平均切成 ~4.06 个 anchor sample。考虑到论文规定 L (function chain 长度) ∼ Uniform{5..20},每条 chain 演出来的对话至少 5 轮 user-assistant 来回,所以平均 ~4 anchor sample 是合理的(每条 chain 平均剩下 ~4 个通过 turn-level filter 的 assistant turn)。

5.2 函数总量 = "20K+"

论文 abstract / §1 反复说 "20k functions from open-source datasets"。这个总量是 xLAM-60K (~60K query 但只覆盖 ~3K function) + Glaive-v2 + ToolACE 三个并集去重后的 ~20,000 个独立 function schema(具体精确数字论文没给)。未在公开来源确认 — HF dataset card 也只是说"20k tools"概数。

5.3 Domain 分布

论文 Figure 4 给了 user intent 的 domain 饼图。前两大类是:

data analysis — 占比最大
entertainment — 第二大
其它(finance / weather / e-commerce / travel / health / tech / sports / education / ...) — "relatively balanced"

论文原话: "the data analysis and entertainment domains are the most prevalent, while the remaining domains are relatively balanced." — 这是来源数据集分布的直接反映(xLAM 大量是 data analysis,Glaive 大量是 entertainment / lifestyle)。

5.4 Turn / tool-call 统计

论文 Figure 3 用 orange (trajectory level) 和 blue (post-split sample level) 两套直方图展示。论文 §3.4 给的定性观察:

trajectory-level 长度分布大致 uniform("relatively uniform"),user 参与度强(每条 trajectory 里都有 user message 反复出现)
post-split sample 分布右偏(短样本多)("the distribution skews toward shorter instances") — 这是 split anchor 后必然的
"sizable fraction of turns proceed without any tool calls" — 这是 user agent 被 prompt "incremental disclosure"逼出来的副产品: 很多 assistant turn 是 clarify-asking 而不是 tool-calling
论文自己承认: "current tasks leave room for greater complexity, motivating future work on more demanding tool-use scenarios" — 承认当前 trajectory 偏简单(5-20 步 chain 不等于 5-20 个 tool call;真正 tool call 数远小于此)

6 · Training + Evaluation

6.1 训练配置 (§4.1)

项目	设定
base model	Qwen3-8B 和 Qwen3-14B(注意不是 Qwen2.5,也不是 Qwen3-32B/235B — paper 主线就这两个尺寸)
训练框架	OpenRLHF(arXiv:2405.11143)
sequence length	64K
global batch size	64
learning rate	5e-6
warmup ratio	0.03
loss 计算	仅 anchored assistant response 的 token(think + content + tool_calls 三段都算)

6.2 评测 (§4.2)

三个 benchmark,各自定位:

BFCL-v4(2025-07-17 发布的最新版) — 用 BFCL 官方 Qwen3 setup,leaderboard 截止 2025-10
τ-bench (arXiv:2406.12045, NeurIPS 2024) — Sierra Research,retail + airline 两域
τ²-bench (arXiv:2506.07982) — τ 的 dual-control 扩展,加 telecom 域;用 GPT-4o 当 user simulator

6.3 BFCL-v4 主结果 (Table 2)

论文 Table 2 同时报了 leaderboard 上的大模型(上半部分)和 ToolMind 自己 SFT 的 8B/14B(下半部分):

Model	Overall	Single Non-Live AST	Single Live AST	Multi Turn	Agentic Search	Agentic Memory
DeepSeek-V3 (FC)	45.20	88.77	79.94	33.00	32.50	22.37
DeepSeek-R1-0528 (FC)	48.97	75.73	80.90	44.50	63.00	0.00
Qwen3-235-Instruct (FC)	54.37	88.10	82.61	44.50	49.00	29.25
Kimi-K2-Instruct (FC)	56.07	84.02	77.57	48.75	59.00	25.16
GPT-4o-2024-11-20 (FC)	50.27	83.88	70.54	42.50	40.50	28.82
GPT5-2025-0807 (FC)	59.22	72.92	58.25	28.50	84.50	57.63
Gemini-2.5-Pro (Prompt)	54.14	89.54	76.83	30.62	66.50	31.61
Qwen3-8B (baseline)	42.21	88.27	80.83	38.88	10.00	18.71
↳ with ToolMind	46.92 (+4.69)	88.06	81.42	46.62 (+7.74)	21.50 (+11.50)	20.43
Qwen3-14B (baseline)	45.14	90.10	80.90	44.12	12.50	21.29
↳ with ToolMind	50.54 (+5.40)	89.00	80.83	51.00 (+6.88)	35.50 (+23.00)	17.85 (−3.44)

读法:

Multi-Turn 和 Agentic Search 涨幅最大 — 14B 在 Agentic Search 上 +23.00 pp,这是 ToolMind 设计的方向 (multi-turn dialogue + tool use)。
Single-Turn 几乎不动(8B Non-Live AST 反退 -0.21 pp, 14B 退 -1.10 pp) — 因为 baseline Qwen3 在 Single-Turn 已经接近天花板(88-90),SFT 没空间。
Agentic Memory 14B 反退 -3.44 pp — ToolMind 没专门覆盖 memory_kv / vector / rec_sum 那几个 V4 子项,这是个明显的暴露面。
14B + ToolMind 50.54 超过 DeepSeek-V3 (45.20)、DeepSeek-R1 (48.97)、GPT-4o (50.27),但仍输 Qwen3-235 (54.37) 和 GPT-5 (59.22) — 数据增强的效果显著但封顶在 ~50 左右。

6.4 τ-bench + τ²-bench (Table 3)

Model	τ-bench Avg	τ retail	τ airline	τ² Avg	τ² retail	τ² airline	τ² telecom
Qwen3-8B (baseline)	35.83	35.65	36.00	34.64	43.86	32.00	28.07
↳ with ToolMind	46.70 (+10.87)	57.39 (+21.74)	36.00	46.41 (+11.77)	59.65 (+15.79)	48.00 (+16.00)	31.58 (+3.51)
Qwen3-14B (baseline)	38.78	49.56	28.00	40.65	52.63	36.00	33.33
↳ with ToolMind	53.00 (+14.22)	60.00 (+10.44)	46.00 (+18.00)	49.08 (+8.43)	59.65 (+7.02)	56.00 (+20.00)	31.58 (−1.75)

读法:

τ-bench retail 域 14B 60.00, 8B 57.39 — 这是 ToolMind 最强的成绩点。retail 域和 ToolMind 大量 data-analysis / e-commerce intent 直接 distribution match。
τ²-bench airline 域 14B +20.00 pp(28.00 → 46.00 不对,应是 36.00 → 56.00,这是 14B 那行,从 baseline 36.00 涨到 56.00)— 涨幅惊人。
τ²-bench telecom 14B 反退 −1.75 pp — 和 #22 TOUCAN 一样的弱项,telecom 域涉及 dual-control 的特殊指令,纯 SFT 数据合成路线很难覆盖。

6.5 Ablation (Table 4 — 已在 §4 引用)

已经在 §4.3 里展开。这里只回顾结论:

开源 augmented 数据贡献最大(τ/τ² 涨幅大),graph-based synthesized 数据贡献集中在 BFCL-v4
turn-level filtering 在 τ-bench (+7.00) 和 BFCL-v4 (+2.76) 上稳定 +,在 τ²-bench 上 −2.88(反例,需注意)
full ToolMind 几乎在所有指标上 ≥ 单组件,验证组合有效

7 · 与已有笔记的横向对比

7.1 与 #22 TOUCAN 的对比 — 最重要的一张表

维度	TOUCAN	ToolMind
规模	1.5M raw → 119K SFT subset	368K (160K synth + 200K augmented)
tool 来源	495 真实 MCP server + 2,000+ tools(HTTP 可联通)	~20K function schema(不可执行)
tool response	真 server 真执行	tool agent LM 模拟(类 ToolACE)
合成方式	5 LLM 出题 × 3 策略 → 3 teacher × 2 framework 生 trajectory	function graph + random walk 采 chain → 三角色 multi-agent 模拟
多样性源头	MCP 生态本身的天然多样性 + 多 teacher 风格	function graph 的 walk 多样性 + per-node visit limit
过滤策略	规则(tool-call 顺序)+ 6 维 Likert trajectory 级	trajectory 二元 + turn 二元(2-stage, mask 不丢)
过滤粒度	trajectory level only	turn level + masking(核心新意)
self-correction 保留	差(整 traj drop)	好(只 mask 错 turn)
base model	Qwen2.5-7B/14B/32B	Qwen3-8B/14B
SOTA 主指标	BFCL-v3 32B 70.45 险胜 GPT-4.5 (70.32)	BFCL-v4 14B 50.54;τ-bench 14B 53.00
license — dataset	Apache-2.0(干净)	Apache-2.0(校订: 早期担心 ToolACE NC 传染是错的,实测 ToolACE Apache,见 #35)
开源 ckpt	27 个(3 官方 + 24 衍生)	paper 没发(`ToolMind-Web-3B` 是后续不同方向)
哲学差异	"规模 + 真" 双堆	"结构化采样 + 细粒度过滤" 双精

什么时候选哪个:

选 TOUCAN 如果: 你需要真 MCP server 行为(rate-limit / auth-fail / stateful 多 call);你要做商用且 license 不能含 NC 成分;你要 32B 大 SFT 来打 BFCL-v3。
选 ToolMind 如果: 你优先 multi-turn dialogue clarification 能力(被 incremental-disclosure prompt 训出来);你要新一代 Qwen3 base;你愿意从 dataset card 自己去掉 ToolACE 子集来保证 license 干净;你想沿用 turn-level filter 思路自己再做 RL stage。
组合: 冷启用 TOUCAN,精调用 ToolMind turn-level 思路自己过滤一遍 — 这是个实操合理路线。

7.2 与其它已有笔记的关系

笔记	关系
#18 AWM (Snowflake)	同样是"tool 不可执行用 simulator"路线,但 AWM 用 SQLite/Python 当确定性 backend(状态可查 / 可重置),ToolMind 用 LM 当 tool agent — AWM 验证更严,ToolMind 实现更简。
#20 SETA	SETA 是 RL 路线,400 个 Docker 终端任务;ToolMind 是 SFT 路线,360K 合成对话。互补,不竞争。SETA + ToolMind 串训(SFT 冷启 + RL 在线)是合理 recipe。
#23 EnvScaler	EnvScaler 把 env 写成 Python class(类似 AWM 思路),ToolMind 把 env 写成 LM tool agent。都解决 "tool 不可执行" 问题但走不同路。EnvScaler dual-agent 100 轮质量门 + Reinforce++ vs ToolMind turn-level mask + SFT,前者重 RL 后者重 SFT。
#28 BFCL	ToolMind 的主评测床。注意 ToolMind 用的是 v4(2025-07-17 发布),不是 TOUCAN 用的 v3。V4 重排了权重(Agentic 40% + Multi-Turn 30% + Live 10% + Non-Live 10% + Irrelevance 10%),所以两篇 paper 数字不直接可比。
#29 EnvTuning (ICLR 2026)	EnvTuning 的 Fine-grained Progress Reward 也是 per-turn 信号,但用在 RL 阶段(每步给 r^state + r^exec);ToolMind 的 turn-level filter 是 SFT 阶段的二元 mask。两者都是 "per-turn signal is critical" 这条直觉的不同实现。
#30 ACEBench (EMNLP 2025 Findings)	ToolMind 没在 ACEBench 上评。但 ACEBench 的 Multi-Step / Multi-Turn 子项和 ToolMind 训练目标完全 align — 在 ACEBench 上跑一次能更准确判断 ToolMind 的能力边界,paper 本身没做。
#31 Prime Intellect Hub	Prime 的 verifiers SDK 提供 SingleTurnEnv / MultiTurnEnv / ToolEnv,ToolMind 数据可以直接 wrap 成一个 verifiers env 上传 Prime Hub 给 community 复用 — paper 没这么做,但很自然的下游。
#26 MCP Bench code dive	ToolMind 数据风格(LM 模拟 tool response)和 MCP-Universe / MCPMark 的真 server 验证方向不兼容 — ToolMind 上训的模型迁到真 MCP 会暴露 "模拟 tool response 风格 → 真实 server response schema" 的 distribution gap。

8 · 开源现状清点

ToolMind dataset

项目	URL / 状态
HF dataset	huggingface.co/datasets/Nanbeige/ToolMind
license	Apache-2.0(HF 自标,7 个来源数据集逐一验证 Apache/MIT/CC-BY,无 NC 传染。2026-05-18 校订: 早期 ToolACE NC 担忧已撤销,见 #35)
下载量 (截止 2026-05)	2,759 次下载 / 147 likes
created	2025-10-21
last modified	2026-01-09
文件组织	两个 split: `graph_syn_datasets/graphsyn.jsonl` (合成部分) + `open_datasets/*.jsonl` (7 个 augmented 来源各一)
总存储	7.87 GB
arXiv paper	2511.15718 (v1: 2025-11-12 → v2: 2025-12-05)

Synthesis 代码 / 训练代码

未公开 — 论文 abstract 只给了 dataset 链接,没有 code repo;HF dataset card 也没有 GitHub 链接。GitHub 上搜 "ToolMind Nanbeige" 无结果。这意味着 turn-level filter 的具体实现(用什么 LM 当 judge、阈值怎么定、prompt 是否完整就是 Appendix A 那段)只能照 Appendix 自己复现。

Fine-tuned checkpoints

Paper 主线 Qwen3-8B-ToolMind / Qwen3-14B-ToolMind 没有发布。Nanbeige HF org 上倒是有后续衍生的 search agent:

Nanbeige/ToolMind-Web-3B — 1,622 下载 / 24 likes,Apache-2.0,base 是 Nanbeige4-3B-Thinking-2511(LlamaForCausalLM)而不是 Qwen3;走的是 SFT + RL with turn-level judge / turn-level reward(同源思路);专门针对 deep-search(评测 GAIA / Xbench-Deepsearch / HLE / BrowseComp);非 paper 主线产品。
mradermacher/ToolMind-web-3B-GGUF + i1-GGUF — 量化版
Esther22/qwen3-4b-toolmind-10k-3epochs — 社区独立用 ToolMind 训 Qwen3-4B 的实验,10K 子集,3 epoch

Nanbeige 团队背景

Nanbeige Lab 是BOSS 直聘 (kanzhun.com) 的 AI 实验室,联合通讯作者 Yang Song (songyang@kanzhun.com)。HF 上历史模型:

Nanbeige4.1-3B — 212K 下载 / 1106 likes(2026-04 起爆款的小模型)
Nanbeige-16B-Chat / Base 系列(TheBloke 在 2023 帮量化过)

另一通讯作者 Wayne Xin Zhao (batmanfly@gmail.com) 是人民大学高瓴 AI 学院正教授,他主导的 RUCAIBox 也产出了 #23 EnvScaler (同 lab) — 这两篇是同一研究方向群的两个不同切片。

9 · Critical analysis — 局限 / 个人 take

9.1 论文自报的局限

论文 §3.4 末尾自己承认: "these findings indicate that current tasks leave room for greater complexity, motivating future work on more demanding tool-use scenarios." — 承认当前 trajectory 偏简单。后续 ToolMind-Web-3B 的 RL 训也是对这个的响应。

9.2 我观察到的几个 untouched 问题

turn-level judge 自己是用什么 LM 跑的,论文没说。未在公开来源确认
- 如果是 self-judge(用 base Qwen3 自己判),那就有 self-reward bias: judge LM 喜欢的 turn 不一定是真好的,只是合 judge 自己的偏好,SFT 后等于把 judge 偏好烙进 student。
- 如果是更强的 LM(GPT-4o / Kimi-K2)判,那需要额外明示;Appendix A.3 的 prompt 是 LM-agnostic 的,看不出来。
- 论文 Table 2 列了 GPT-4o-2024-11-20 作为 baseline 对比,且 τ²-bench 评测里用 GPT-4o 当 user simulator — 一个合理猜测是判 turn 也用 GPT-4o。但论文从没说出来。
turn-level binary judge 是 0/1,信息量很低。这和 trajectory level 用 6 维 Likert 的 TOUCAN 对比:
- binary 简单,但任何"不完美但有价值"的 turn 一刀切,这正是 §4.3 提到的 τ²-bench 反例可能成因。
- 更精细的设计应该是 per-turn weight ∈ [0,1] 或者 per-turn label among {good / recoverable / bad},但 ToolMind 没做。
tool 是 LM 模拟的,不是真服务。这是和 TOUCAN 的根本 philosophy 差异:
- 优点: 不依赖外部 server,合成可大规模并行,无 rate-limit。
- 缺点: tool agent 模拟 error 的 distribution 是 LM 想象的 error,真 server 的 error mode(timeout / partial response / silent data corruption / cross-call state inconsistency)无法演出。
- 这导致 ToolMind 训出来的模型在真 MCP server 上的 OOD 行为没保证。论文也没在 MCP-Universe / MCPMark / MCP-Atlas 任何一个真 server 评测上跑。
"20K tools" 实际是 schema 不是 callable endpoint。这件事 ToolMind 的描述方式("over 20k tools")容易让读者误以为是 MCP-Universe 那种真 callable 数量。纠正: 这里的 "tools" 等价于 "function schemas drawn from xLAM/Glaive/ToolACE"。
license — 早期判断已撤销 (2026-05-18 校订)。
- 之前担心 ToolACE 是 CC-BY-NC → ToolMind 继承 NC,但实测 Team-ACE/ToolACE HF cardData = Apache-2.0(见 #35)。整个 ToolMind 来源链 license 干净。
- HF dataset 标 Apache-2.0 是准确的。
- 对比 TOUCAN: 两者 license 同档,选择应基于方法学差异(turn-level vs trajectory-level)而非合规。
没发 paper 主线 ckpt 是个反常信号。Nanbeige 后来发了 ToolMind-Web-3B 这种转向 deep-search 的衍生品,但论文里训出来的 Qwen3-8B-ToolMind / Qwen3-14B-ToolMind 至今没在 HF 上出现。可能原因:
- (a) 内部产品集成了,不方便单独发(BOSS 直聘业务相关)
- (b) license 问题(混了 NC 数据训出来的 ckpt 发不发涉及合规)
- (c) 团队转向了新方向,旧 ckpt 维护成本高
- 对比 TOUCAN 27 个 ckpt 全发,ToolMind 的开源 commitment 在数据层是好的,在模型层缺位。
没评 Agentic Memory 的几个 V4 子项(memory_kv / vector / rec_sum)— Qwen3-14B + ToolMind 在 Agentic Memory 上反退 −3.44 pp 是个明显告警:ToolMind 数据没有覆盖 long-term memory / KV-store 操作场景,SFT 后反而干扰了原模型的 memory 处理能力(catastrophic forgetting 的一种)。这是个具体可改进方向。

9.3 我会给的实操建议

"我要训一个 ≤14B 的 tool-use SFT 模型,选哪个 dataset"

三阶段叠加:

冷启 SFT 用 TOUCAN 119K(真 MCP server response,license 干净)— 学到"真 server 错误处理 + tool chain 真依赖"
能力补丁 SFT 用 ToolMind 的 graph_syn_datasets 子集(163,180 sample,跳过 NC 来源)— 学到"graph-walk 出来的 multi-step chain + incremental clarification"
RL stage 接 SETA Docker 400 task 或 EnvTuning BFCL-V3 400 sample,做 RLVR 上一层 — 这是 ToolMind 自己没做的部分

预期上限: BFCL-v4 Overall ~55,τ-bench ~55,τ²-bench ~50 — 这是当前 ≤14B 开源能堆到的天花板。再上就要 RL 在真 MCP 上(MCP-Universe / MCPMark)。

"我只想做 turn-level filter 复现"

看 Appendix A.3.2,prompt 已经完整给了。把任意 multi-turn dataset(TOUCAN / xLAM / Glaive)按 assistant turn 切开,每个 anchor 让 GPT-4o-2024-11-20(或 Qwen3-235-Instruct 当 self-host judge)二元判一次,过 1 的 anchor 保留,过 0 的 mask。注意: judge 的 LM 不要和 student 同型号,以减少 self-reward bias。

9.4 一句话总结

ToolMind 是 "工具数据集应该 turn-by-turn 过滤而不是 trajectory-by-trajectory 过滤" 这个 small-but-real method insight 的扎实实现,加上一个清楚的 function-graph 多样性源头,产生一个 ~360K 规模 Apache-2.0(名义上)的 dataset,在 Qwen3-8B/14B 上分别 +4.69 / +5.40 BFCL-v4 Overall pp、+10.87 / +14.22 τ-bench pp。规模上输 TOUCAN,真实度上输真 MCP 路线,但方法论 (turn-level mask) 是有 transferable insight 的 — 任何后续 tool-agentic 数据集都应该考虑加这一层。

读完笔记后建议: ① 看 #22 TOUCAN 把 trajectory-level baseline 对照清楚;② 看 #28 BFCL 把 v3→v4 改版搞清楚;③ 翻 Appendix A.3.2 真把 turn-level prompt 抄出来跑一次。如果要做 RL 续训,跳到 #29 EnvTuning 看 per-turn reward 的 RL 版本。