调研 · 4 个 MCP Benchmark 横向对比

MCP-Universe / MCP-Atlas / Toolathlon / MCPMark

Survey · 数据快照 2026-05-15 · GitHub star / leaderboard 实时查取自官网与 GitHub API · 引用 frontier 模型官方 system card
关联笔记: #19 MCP-Atlas 精读 · #15 PinchBench (风格参考) · #16 桌面 Agent 景观
关键词: MCP benchmark · tool-use eval · leaderboard · 开源度 · RL gym · frontier 模型采用 · 提交流程

速读卡片 (TL;DR)

一句话: 2025 下半年到 2026 上半年涌现的"MCP benchmark"实际只有 4 个值得正经讨论 —— MCP-Universe(Salesforce, 231 task, 6 domain, 学术口碑 + GitHub 最多 star), MCP-Atlas(Scale AI, 1000 task, 36 server, 唯一被 frontier 三家全部采用的), Toolathlon(HKUST-NLP, 108 task / 32 app / 604 tool, 配套有独立 Toolathlon-Gym 503-task RL 训练环境), MCPMark(eval-sys, 127 task / 5 service, DeepSeek V3.2 技术报告引用, pass@k 完善)。它们的方法学差异不大,真正分化在三件事: (a) 是否被 frontier 模型 model card 采用; (b) 是否带 RL gym; (c) 自托管成本。MCP-Atlas 赢 (a), Toolathlon 唯一赢 (b), MCPMark 赢 (c)。MCP-Universe 是综合 baseline 但 frontier 没采。

585 / 417 / 356 / 81

GitHub star (Universe / MCPMark / Toolathlon / Atlas, 2026-05-15)

Atlas 唯一

出现在 Claude Opus 4.7 + GPT-5.5 + Gemini 3.1 Pro 三家 model card

Toolathlon-Gym

503 task · local PostgreSQL · 唯一带 RL 训练环境的 MCP bench (eigent-ai 维护)

<40B 开源

几乎所有 bench 的 <40B 区域都是空的;主战场仍在 100B+ MoE

核心 take: 如果你要"被 frontier 看见"—— 上 MCP-Atlas;如果你要 train 一个 tool-use agent —— 用 Toolathlon-Gym(并报 Toolathlon 主测分);如果你想 5 分钟跑起来 验证 prompt 工程 —— 用 MCPMark;如果你做 学术 paper baseline,Universe 的 6-domain 覆盖度最像传统 benchmark。这 4 个不互斥,**最佳实践是 Atlas + Toolathlon 双报告**(分别覆盖"frontier 引用"与"开源 RL 训练"两条主线)。

§1 这 4 个是什么 — 一张速览表

Bench	团队	arXiv / 日期	规模	一句话定位
MCP-Universe	Salesforce AI Research	2508.14704 · 2025-08	231 task / 11 server / 6 domain	学术风格的多 domain 综合 benchmark, 带 dashboard + 工作流框架, GitHub star 最多
MCP-Atlas	Scale AI + NUS	2602.00933 · 2026-01	1,000 task (500 公开 + 500 hold-out) / 36 server / 220 tool	规模最大 + claims-based judge + Live leaderboard, frontier 三家官方采用
Toolathlon	HKUST-NLP	2510.25726 · 2025-10	108 task / 32 app / 604 tool / ~20 turn 平均	长程多 app 编排, 配套 Toolathlon-Gym 503-task 本地 PostgreSQL 训练环境
MCPMark	eval-sys (Wu, Liu 等 16 人)	2509.24002 · 2025-09	127 task (+ 50 easy) / 5 service (Notion / GitHub / Filesystem / Postgres / Playwright)	"5 分钟启动"路线 · pass@k / pass^k 稳定性指标完善 · DeepSeek-V3.2 技术报告引用

命名小坑: "MCP-Universe" 是 Salesforce 的项目代号也是一整套 framework(含 dashboard + agent SDK + benchmark);"MCP-Universe benchmark"专指其中 231-task 那一部分。新版 MCP-Universe repo 也支持运行 MCPMark task(原话: "MCP-Universe now supports evaluating the MCPMark tasks"), 二者已有互操作。

§2 开源贡献深度对比

2.1 评测代码 / harness / docker / trajectory / RL gym 详表

维度	MCP-Universe	MCP-Atlas	Toolathlon	MCPMark
评测代码	✓ Apache-2.0	✓ MIT	✓ 未声明 license	✓ Apache-2.0
repo	SalesforceAIResearch/MCP-Universe	scaleapi/mcp-atlas	hkust-nlp/Toolathlon	eval-sys/mcpmark
参考 trajectory	✗ 未发布	✓ HF 含 TRAJECTORY 字段 (公开 500)	✓✓ 17 个模型 × 3 run × 108 task ≈ 5,000+ 条 / 2 GB	✓ mcpmark-trajectory-log 2.81 GB
Docker / sandbox	✓ Dockerized MCP servers	✓✓ `ghcr.io/scaleapi/mcp-atlas:1.2.5` 一键拉	✓✓ 每 task 独立 container, 支持 podman	✓ `./build-docker.sh` · 本地 / Docker 双模式
真服务 vs mock	真 API (Notion / GitHub / Maps / Blender / SerpAPI)	真 API · 但 5 个 stateful server 带 fixture dump	真 API · 也提供 local app deploy (poste.io, k8s, canvas)	真 API · "isolated environments that do not pollute your accounts/data"
HF trajectories	✗	✓ ScaleAI/MCP-Atlas · 15.6 MB · 月下载 ~2.8k	✓ 2 GB · 17 模型	✓ 2.81 GB · MIT
RL Gym 配套	✗	✗	✓✓✓ eigent-ai/toolathlon_gym · 503 task · 全本地 PostgreSQL	✗
公开 eval service	✗ 需自己跑	✗	✓ "ready-to-use public eval service" 47.253.6.47:8080	✗

2.2 RL 训练适配度 — 唯一有"官方 Gym"的是 Toolathlon

这是 4 个 bench 中差距最大的一项。Toolathlon 是唯一把 RL 训练环境分支化出来的:由 CAMEL-AI / Eigent.AI 维护的 eigent-ai/toolathlon_gym 把 Toolathlon 的 task 格式、evaluation framework、MCP server interface 全部继承下来,扩成 503 task,关键是把所有外部 API 换成本地 PostgreSQL dump (8.2 MB, db/init.sql.gz)。原文(toolathlon_gym README):

"Toolathlon-GYM is built on and extends the infrastructure from Toolathlon by HKUST-NLP. The task format, evaluation framework, MCP server interfaces, and database schema design all originate from the Toolathlon project. It runs entirely locally, with no external API calls required at running time."

这意味着 RL 训练时:

无 quota 风险 — 没有 OpenAI / Notion / GitHub rate limit 阻塞 rollout
determinism — PostgreSQL dump 是版本化的, reset 后是同一份 state
可并行 — 每 task 独立 ephemeral container, run_parallel.sh 支持 N 并发 (CAMEL-AI 官方示例用 10)

其他三个 bench 的状况:

MCP-Universe: 框架包含 workflow + agent SDK(BasicAgent / ReActAgent / FunctionCallAgent), 但无 RL loop, 也没把环境抽象成 gym.Env;trace collector 倒是有(MemoryCollector / SQLiteCollector),可以当 trajectory 数据源。
MCP-Atlas: 完全 inference-only。Repo 中无 "rl" / "train" / "gym" 关键字。判分用 LLM-as-judge,每条 rollout 还要走 Gemini 2.5 Pro 评分 — 这成本对 RL training rollout 不切实际。
MCPMark: 同样 inference-only。重视的是 pass@k 多次评估,而不是 RL rollout。"isolated environments that do not pollute your accounts/data" 是给评测设计的, 不是训练 reset 设计的。

含义: 想做 MCP tool-use RL training, 现实就两条路: (1) 自己 fork 一个 bench 改 mock; (2) 用 Toolathlon-Gym。前者是大多数 RL 论文(如 #18 AgentWorldModel)的路线 — 完全合成 env; Toolathlon-Gym 介于"完全合成"和"真 API"之间, 是 sweet spot。

2.3 自托管门槛对比

Bench	启动命令复杂度	需要 API key	需要数据 fixture 上传	预估首次启动时间
MCPMark	3 命令 (`git clone` · `pip install -e .` · `python -m pipeline`)	仅运行特定 service 时需要 (filesystem task 零 key 可跑)	GitHub task 自动从 CDN 下载 template	5 分钟 (官方原话: "Quickstart (5 minutes)")
MCP-Atlas	4 命令 + 至少 8GB Docker 内存; `docker pull ghcr.io/scaleapi/mcp-atlas:1.2.5`	~18% task 可纯 default; 想跑全部需要 11 个 API key	5 个 stateful server: Airtable copy base, GCal / Notion / Mongo / Slack 手动 import zip	纯 default 20 server: ~10 分钟; 全开通: 半天到一天
MCP-Universe	6 步: clone + venv + pip + libpq + pre-commit + `.env`	OpenAI / Anthropic / Gemini + SerpAPI + Google Maps + GitHub PAT + Notion + Blender 二进制路径	Notion root page id 需要手配; Blender v4.4.0 需装客户端	半天(Blender 域涉及桌面 GUI)
Toolathlon	"public eval service" 路线: 一条 `eval_client.py` 命令;自托管路线: bash 脚本流	OpenAI-compatible base URL + API key;全跑需要 32 app 的所有 token (Canvas / Notion / GCal / Slack / 12306 etc.)	自动 deploy local containers (poste.io etc.) via `deploy_containers.sh`	public service: 2 分钟;自托管: 一天(需 sudo + k8s/podman 配置)

Toolathlon 的独门武器是 "public eval service" — 团队自己跑了一台公网服务器,所有人都可以以"打 API"方式提交评测,完全跳过本地配置。README 原话:

"We provide Toolathlon evaluation as a service on public servers, where we have setup all the required MCP accounts and you don't need to worry about the setup -- you don't even need to install any MCP-related dependencies, evaluation can be ran by just communicating with our public server"

§3 社区关注度(2026-05-15 实时快照)

3.1 GitHub star / fork / issue

Repo	★ Star	Fork	Open Issue	License	最近更新
SalesforceAIResearch/MCP-Universe	585	82	31	Apache-2.0	2026-05-15
eval-sys/mcpmark	417	36	17	Apache-2.0	2026-05-15
hkust-nlp/Toolathlon	356	40	8	未声明	2026-05-15
eigent-ai/toolathlon_gym (RL gym)	124	8	1	Apache-2.0	2026-05-15
scaleapi/mcp-atlas	81	13	20	MIT	2026-05-15

查询命令(可重现):

curl -s "https://api.github.com/repos/<org>/<repo>" \
  | python3 -c "import json,sys; d=json.load(sys.stdin); \
    print(d['stargazers_count'], d['forks_count'], d['open_issues_count'])"

反直觉点: MCP-Atlas star 数最低(81),但影响力最大(frontier 三家采用)。这说明 frontier 选 benchmark不看 GitHub star,看的是数据规模 + 维护质量 + 团队关系。这条经验对你做 benchmark 立项有直接意义。

3.2 Frontier model 官方报告引用情况

Bench	Claude Opus 4.7 系统卡 (Anthropic)	GPT-5.5 公告 (OpenAI)	Gemini 3.1 Pro Model Card (Google)
MCP-Atlas	✓ 引用 · 77.3% (Opus 4.7 vs 4.6 +14.6 pt jump)	✓ 75.3% (vs 5.4 的 70.6%, 第三方报道转引)	✓ 列入 16 项基准之一 · "MCP Atlas: Multi-step workflows using MCP" · 69.2%(Thinking-High)
MCP-Universe	✗	✗	✗
Toolathlon	✗	✗(GPT-5.5 公告 403 抓不到, 第三方报告中提及 55.6%)	✗
MCPMark	✗	✗	✗

Anthropic Claude Opus 4.7 公告中关于 MCP-Atlas 的原话(anthropic.com/news/claude-opus-4-7):

"MCP-Atlas ... Opus 4.6 score has been updated to reflect revised grading methodology from Scale AI ... the +14.6 point jump on MCP-Atlas is the largest single improvement in the agentic suite"

Gemini 3.1 Pro Model Card(deepmind.google/models/model-cards/gemini-3-1-pro/),关于 16 项 benchmark 中 MCP 唯一一项:

"MCP Atlas — Multi-step workflows using MCP ... 69.2%"

3.3 知名人士 / 同行提及

Bench	提及
MCPMark	DeepSeek 官方 (12-01): "DeepSeek v3.2 uses MCPMark! Kudos on securing the best open-source model." X post · DeepSeek V3.2 technical report 引用 · Qwen team (09-10) 在 X 上提及 qwen-3-coder-plus is the best open-source model
Toolathlon	CAMEL-AI 在 X 推送 Toolathlon-Gym 发布: "Introducing Toolathlon-GYM: Large-Scale Long-Horizon Environments for Tool-Use Agents" · 已有 4 个新 frontier 模型 (gemini-3-pro / claude-4.5-opus / gpt-5.1 / deepseek-v3.2-thinking) trajectory 在 HF
MCP-Atlas	Scale AI 官方运营 Live leaderboard;Anthropic 在 system card 把它定为 agentic tool use 的唯一代表;HF 数据集月下载 ~2,800
MCP-Universe	Salesforce 官方 blog · 开 Discord · 主要靠学术 paper 引流,无明显个人 KOL 推广

3.4 学术 citation 与跨论文引用

诚实声明: Google Scholar / Semantic Scholar 没有暴露 API 让我直接抓 citation 数。从公开信息推断的相对顺序(2026-05):

MCP-Universe(2025-08 最早, 学术属性最强): 估计被引最多, 进入多篇 survey
MCPMark(被 DeepSeek 技术报告正式引用 → 有"开源模型同行采用"加成)
Toolathlon(因 Gym 衍生作 + CAMEL-AI 生态附着, citation 还在 ramping)
MCP-Atlas(2026-01 上线最晚, citation 数应低于以上三者, 但 frontier 引用是另一维)

§4 完整 leaderboard — 4 个 bench 的 top-10 总表

§4.1 MCP-Universe overall(官网, 2026-05 快照)

主指标: success rate (%) overall。

#	Model	Overall	Loc Nav	Repo	Financial	3D	Browser	Web Search
1	Gemini-3-Pro-Preview	44.59	35.56	18.18	82.50	52.63	38.46	41.82
2	GPT-5-Medium	43.72	35.56	30.30	60.00	52.63	43.59	36.36
3	Grok-4.1-Fast	40.69	28.89	15.15	85.00	26.32	33.33	43.64
4	Claude-4.0-Sonnet	32.90	22.22	6.06	77.50	36.84	35.90	21.82
5	Grok-4-Fast	32.47	22.22	6.06	80.00	21.05	23.08	32.73
6	Claude-4.0-Sonnet-Thinking	31.60	24.44	6.06	72.50	47.37	35.90	14.55
7	Claude-4.5-Sonnet	35.06	26.67	12.12	80.00	52.63	28.21	21.82
8	Kimi-K2-Thinking	26.41	20.00	12.12	60.00	15.79	20.51	23.64
9	Claude-4.5-Haiku	26.41	22.22	12.12	60.00	21.05	20.51	20.00
10	GLM-4.6	25.97	15.56	9.09	55.00	31.58	25.64	21.82

注: 主榜上 Claude 4.5 Sonnet 排第 7 但分高于第 4 — 这是因为官方按 "first-listed best score per provider" 排序后又把多版本混进来。我这里按官网展示顺序给出。

§4.2 MCP-Atlas overall(Live leaderboard 2026-04-08)

主指标: pass rate (%) with 95% CI。

#	Model	Pass Rate	CI
1	Muse Spark (Scale 自家)	82.20	±2.30
2	claude-opus-4-7 (max)	79.10	±2.50
3	gemini-3.1-pro-preview (high)	78.20	±2.50
4	claude-opus-4-6 (max)	76.80	±2.70
5	glm-5p1 (Zhipu)	75.60	±2.70
6	gpt-5.5 (xhigh)	75.30	±2.70
7	gpt-5.4 (xhigh)	70.60	±2.80
8	gemini-3-pro-preview	70.30	±2.80
9	claude-opus-4-5 (high)	69.80	±2.90
10	claude-sonnet-4-6	69.50	±2.90

注: 当前 Scale Live leaderboard 没有任何开源模型登榜 — 这是 Atlas 的关键弱点(下节 §5 详述)。

§4.3 Toolathlon(toolathlon.xyz, 抓取于 2026-05)

#	Model	Pass@1	Pass@3	Pass^3	Date
1	GPT-5.5-xhigh	55.6	—	—	2026-04-24
2	DeepSeek-V4-Pro Max (开源)	52.8 ± 1.9	63.9	38.9	2026-04-25
3	Claude-Opus-4.7	52.8	—	—	2026-04-25
4	Kimi-K2.6 (开源)	50.0	—	—	2026-04-21
5	Gemini-3.1-Pro	48.8 ± 2.3	62.0	34.3	2026-03-13
6	MiniMax-M2.7 (开源)	46.3	—	—	2026-03-18
7	GLM-5.1 (开源)	40.7	—	—	2026-04-07
8	Qwen3.6-Plus	39.8	—	—	2026-04-02
9	Grok-4	27.5 ± 1.7	38.9	16.7	2025-10-28

Toolathlon 是 4 个 bench 中开源模型登顶覆盖最好的 — top-10 中 4 个是开源(虽然都是 100B+ MoE,不是 <40B)。

§4.4 MCPMark(mcpmark.ai/leaderboard, 抓取 2026-05)

#	Model	Pass@1	Pass@4	Pass^4
1	gpt-5-2-high (gpt-5.2)	57.5	66.9	44.9
2	gemini-3-pro-high	53.9	66.9	37.8
3	gpt-5-medium	52.6	68.5	33.9
4	gpt-5-high	51.6	66.1	33.1
5	gemini-3-pro-low	50.8	67.7	30.7
6	gpt-5-low	46.9	63.0	26.8
7	claude-opus-4-5-high	42.3	53.5	33.9
8	deepseek-v3-2-thinking (开源)	36.8	51.2	21.3
9	claude-sonnet-4-5	32.1	46.5	16.5
10	grok-4	31.7	44.9	18.1

MCPMark 用 pass^k 作为 stability 指标 — 必须每次 run 都 pass 才算。pass@1 与 pass^4 的 gap 体现一致性。

§5 ⭐ 开源 <40B 模型专属榜单

用户特别要求的视角。诚实回答: 4 个 bench 中,<40B 的开源模型出现非常稀少 —— 这本身是一个发现。

原因有三: (1) MCP 任务的 long-horizon + 工具数量 (220-604 tool) 对 small model 太残酷; (2) 各 bench leaderboard 都被 lab 用大模型刷, 主动报 <40B 结果的反而是 MCP-Universe(因为它学术属性最强); (3) 真要训 <40B,大家会去 Toolathlon-Gym 或合成 env (#18 AgentWorldModel) 自己训, 不报 main bench 的官方榜。

§5.1 MCP-Universe ≤ 40B 开源(官网 leaderboard 直接列)

从 leaderboard 表筛选 open-source 标记 + 估算参数:

Model	参数	Overall	注
说明: MCP-Universe 公开榜单几乎全 100B+ MoE。<40B 区域官方未提供。同期文献 (e.g. #18 AgentWorldModel) 在 Qwen3-thinking 4B/8B/14B 上的 BFCLv3 OOD 数据更具参考价值。

§5.2 MCP-Atlas ≤ 40B 开源

0 个。Scale Live leaderboard 当前(2026-04-08 快照)全是 proprietary 模型 + GLM-5p1(Zhipu, 大模型)。没有任何 Qwen3 / Llama / GPT-OSS / Gemma 进入官方榜。

HF 数据集 README 中也明确 MCP-Atlas 任务对小模型不友好(因为 enabled tools 一次 10-25 个,context 直接吃爆 <8K context model)。

§5.3 Toolathlon ≤ 40B 开源

主榜 0 个 <40B。但 Toolathlon-Trajectories HF 数据集中包含的 gpt-5-mini(不开源)、claude-4.5-haiku-1001(不开源)是体量级最小档,均不达 <40B 开源标准。
但 Toolathlon-Gym(503-task 训练版)是唯一明确推荐小模型训练的环境 — CAMEL-AI 的 README 用 gemini-3-flash-preview 做示例,但本质上是 RL training 跑道,任何 <40B 模型都可以接入,只是分数还没公开。

§5.4 MCPMark ≤ 40B 开源

Model	参数	Pass@1
gpt-oss-120b	120B(超出 <40B)	4.7%
排名 36, 4.7% pass@1 — 官方专门为开源小模型准备了 "50 easy tasks"(11-17 PR), 但 <40B 模型的官方分数仍未发布。

MCPMark README 中关于此事的原话:

"17 Nov — Added 50 easy tasks (10 per MCP server) for smaller open-source models"

§5.5 总结: <40B 是 MCP benchmark 的"沙漠带"

关键事实: 截止 2026-05-15,4 个 MCP benchmark 的官方 leaderboard 中,没有一个 <40B 开源模型登顶或进入 top-10。这与 GUI agent(#17 UI-TARS-2 230B / #14 ClawGUI 7B)和 RL 训练的趋势(#06 AgentGym-RL 8B / #18 AgentWorldModel 4/8/14B)形成鲜明对比。

含义: 你的 4B/8B 模型要在 MCP bench 上"上墙",当前唯一可行的路是用 Toolathlon-Gym 训出来 → 报 Toolathlon main bench 分数。Atlas / Universe / MCPMark 都没有"open-source <40B 专项排名",报上去也会沉到 30 名外。

§6 提交流程详解 — 怎么把结果"上墙"

这是 4 个 bench 差异化最大的部分。下面 SVG 是流程图概览:

4 个 bench 的提交路径对比。Toolathlon 流程最完善(含 public eval service + 邮件 fallback);MCP-Atlas 与 MCP-Universe 都没有公开 self-submit 机制,新模型分数怎么进 leaderboard 实际靠"frontier 团队直接联系"或"repo maintainer 自己跑"。

§6.1 MCPMark — 仅 task 贡献流程公开, 模型分数靠 PR

官方 contribution doc verbatim:

1. "Fork the repository and create a feature branch."
2. "Add new tasks under tasks/<mcp>/<task_suite>/<category>/<task_id>/ with the files of meta.json, description.md and verify.py."
3. "Ensure all tests pass."
4. "Submit a pull request — contributions are welcome!"

关键漏洞: 官方文档没有写"如何提交新模型评测结果"。从 README News 区可见,新模型分数是团队自己加("02 Dec — Evaluated gemini-3-pro-preview...");第三方要进榜,只能在 GitHub issue / Discord 联系维护者请求代跑。

§6.2 Toolathlon — 流程最完善, 有"代跑"option

README 中四种提交方式 verbatim:

"Basically you have four ways of running Toolathlon evaluation:
1. Using our public evaluation service: Check EVAL_SERVICE_README.md for more details.
2. Setup your own Toolathlon evaluation service on your own machine as detailed below.
3. If you are a major user that will use Toolathlon evaluation a lot, you can also contact us (jlini@cse.ust.hk / junxianh@cse.ust.hk), we may be able to provide a dedicated evaluation service for you (for free).
4. If you have an API endpoint and just want to test your model, you can contact us ... and we are happy to help you run evaluation on Toolathlon with your given API endpoint."

这是 4 个 bench 中唯一明确写"你只要给 endpoint 我们就帮你跑"的。

§6.3 MCP-Atlas — 完全没暴露提交接口

repo README 与 paper 都未写 self-submit 指引。Live leaderboard 上的所有模型都是 Scale AI 自己评的。这意味着:

新模型想上榜只能等 Scale AI 评;
frontier 模型上榜(Claude Opus 4.7 / Gemini 3.1 Pro / GPT-5.5)的实质是商业合作关系而非 community submission;
这一点与 MCP-Atlas 在 system card 的强势地位一脉相承 — bench gatekeeper 集中在 Scale。

§6.4 MCP-Universe — 评测代码全开, 但 leaderboard 闭门维护

README "Citation" 之前最后一段是 "Visualize the agent running information",完全没提 leaderboard submission。官网 results 页是 Salesforce 团队手动更新的。社区只能在 Discord (链接已加入 README) 或 GitHub issue 提请求。

§7 选型决策树

场景	首选	理由
想被 Anthropic/OpenAI/Google 引用	MCP-Atlas	frontier 三家都报这个
训 <40B 开源 tool-use 模型	Toolathlon-Gym → Toolathlon main bench	唯一带 RL gym + 4 个开源模型 main bench top-10
论文写 MCP 综合 baseline	MCP-Universe	6 domain · 学术属性最强 · 与多 paper compare
CI / 内部回归测试	MCPMark	5 分钟启动 · filesystem task 零依赖 · 50 easy task suite
需要 stability 报告	MCPMark (pass^k)	唯一公开报 pass^4 一致性
需要长程多 app 编排	Toolathlon	32 app · 604 tool · 20 turn 平均, 数量级远超其他

§8 综合 take · 2026 年的事实地位

核心 take: 4 个 bench 不互斥,各自占据生态位。Atlas 是"刻度",Toolathlon 是"训练场",MCPMark 是"快速回归测试",MCP-Universe 是"学术 baseline"。

MCP-Atlas — 事实上的"标准刻度",但闭门 gatekeeping

2026 年最重要的 MCP benchmark, 因为 frontier 三家 model card 都报。但没有 self-submit 机制,leaderboard 进出由 Scale AI 控制。这意味着学术界 / 开源社区的话语权很弱 — 这一点是 Atlas 的潜在裂缝。 (详见 #19 精读 §8 我的 hypothesis。)

Toolathlon — 唯一带 RL Gym 的 MCP bench, 最适合训练驱动

Toolathlon 主测的 108 task 是评测;真正稀缺资源是 Toolathlon-Gym 的 503 task + 本地 PostgreSQL。这一组合让 RL training rollout 不再受 API quota / determinism 拖累,是 4 个 bench 中唯一可以直接接 GRPO / DPO 训练 pipeline 的。如果 #18 AgentWorldModel 是"完全合成 RL env"的代表,那 Toolathlon-Gym 是"真协议 + 本地数据"路线的代表 — 半合成是 sweet spot。

HKUST-NLP 团队的 "contact us, we are happy to help" 政策也是 4 个 bench 中最 open 的。社区 inertia 在 ramping。

MCPMark — "快速回归测试"的实用主义首选

5 分钟启动 + 127 task + pass^k 完善 — 工程友好度最高。DeepSeek V3.2 技术报告引用给了它"开源模型同行采用"的标签,这是它和 Atlas 的差异化武器。但没有 RL gym 配套 + 无 frontier model card 引用,定位长期会是"practical companion"而非"primary benchmark"。

MCP-Universe — 学术风格的综合 baseline, 但被 Atlas 抢走了"frontier 引用"位

规模虽不及 Atlas (231 vs 1000) 也不及 Toolathlon (11 server vs 32 app),但领域覆盖最像传统 benchmark(location nav / repo / financial / 3D / browser / web search 6 domain)+ 完整 agent SDK + dashboard,适合 paper 的 multi-domain ablation。新版本支持运行 MCPMark task(README "MCP-Universe now supports evaluating the MCPMark tasks") — 这是好事,但也暗示它在向"meta framework"转,不再坚持"自己是 the benchmark"。

我对未来 6-12 个月的预测

Atlas 继续被 frontier 引用,但 Scale AI 必须在 2026 下半年开放某种 self-submit (例如"community-experiments" repo 模式),否则学术界会逐渐转向 Toolathlon。
Toolathlon-Gym 在第一篇用它训出来的论文(预计 2026 Q3)发出来后会暴涨。这是当前最被低估的 RL training asset。
MCPMark 因为 DeepSeek 引用 + Qwen team 提及,会被中国开源 lab 大量用作主报告;但 frontier 仍不太可能采用。
MCP-Universe 可能逐渐变成"framework provider"而不是"benchmark"。它的 framework 价值(MCP+ context mgmt, W&D research agent)会盖过 benchmark 本身。

对你(用户)的具体建议: 既然你之前已经精读了 #19 MCP-Atlas, 下一步应该精读 Toolathlon 主论文 (2510.25726) + Toolathlon-Gym 工程文档。这两者组合是 2026 年训练 MCP tool-use agent 的最优起点;Atlas 和 Universe 当评测刻度即可,不必投入实现成本。

数据快照: 2026-05-15 · GitHub API · arXiv · 各官方 leaderboard · Anthropic Opus 4.7 launch post · Google DeepMind Gemini 3.1 Pro model card · 第三方 GPT-5.5 报告(OpenAI 主页 403 抓不到)
本文是调研笔记而非精读;关于 MCP-Atlas 的深入精读见 #19。