MCP-Universe: Benchmarking LLMs with Real-World MCP Servers

Salesforce AI Research · Ziyang Luo · Zhiqi Shen · Wenzhuo Yang · Zirui Zhao · Prathyusha Jwalapuram · Amrita Saha · Doyen Sahoo · Silvio Savarese · Caiming Xiong · Junnan Li · 2025-08-20 (v1)
arXiv:2508.14704 · GitHub: SalesforceAIResearch/MCP-Universe · Website + Leaderboard · Salesforce Blog · MCP+ 项目页
关键词: MCP benchmark · agent framework · GRPO RL training · verl · Deep Research Agent · Wide&Deep · MCP+ context compression · Salesforce

速读卡片 (TL;DR)

一句话: MCP-Universe 是 Salesforce AI Research 推出的"包含 benchmark + RL 训练框架 + 两个独立子产品 (Deep Research Agent / MCP+) 的 MCP agent 全栈实验台"。Benchmark 维度共 11 MCP server / 6 domain / 231 task / 133 tool,SOTA 也只能跑到 GPT-5 43.72%,Grok-4 33.33%,Claude-4-Sonnet 29.44%。真正的差异点不在 benchmark 数字 — 而在仓库里那个 mcpuniverse/rl/ 子包: 完整的 GRPO + verl 集成 / Hybrid + Fully-Async 双模式 / TITO (Token-In-Token-Out) rollout / stdio + sse + docker_pool 三种 transport,这是目前我读到的 MCP 系 paper 里唯一开箱即用的 RL 训练栈。仓库还顺手内嵌了 (a) Deep Research Agent (Wide&Deep,GPT-5-medium 在 BrowseComp 62.2%),(b) MCP+ (CLI mcp-build-plus,给现有 mcp.json 加 -plus 包装实现 50–75% token 压缩),(c) MCPMark task runner — 一手把 evaluation 平台扩成了"MCP 元平台"。

11 / 6 / 231

MCP server / domain / task

43.72%

GPT-5 success rate (SOTA)

GRPO+verl

真有 RL 训练栈 — Hybrid + Fully-Async

585★ Apache-2.0

开源完整度 ★★★★★(2026-05)

立场: 在 4 个 MCP benchmark 里 (cf. #21),MCP-Universe 是唯一把自己定位为"framework"而非"benchmark"的。Github description 一字不改:"MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use." 它和 #19 MCP-Atlas(Scale AI 的纯 eval + frontier judge)是两条不同的路线: Atlas 拼 frontier model card 引用,Universe 拼"我把 RL infra 也给你"。然而,代价是 — 它没进任何主流 frontier model 的 system card。

1 · 背景: MCP-Universe 想做的"framework"野心

1.1 它和 #19/#21 提到的 benchmark 不是一个量级

读 paper 之前需要先理解 — Salesforce 在 GitHub 上挂的描述不是"benchmark":

"MCP-Universe is a comprehensive framework designed for RL training, benchmarking, and developing AI agents for general tool-use."
— GitHub repo description, 验证于 2026-05

这句话和 paper 标题 "Benchmarking Large Language Models with Real-World Model Context Protocol Servers" 之间存在明显错位。Paper 写于 2025-08,主要是 benchmark 论文; 但在论文发布后的近一年里,Salesforce 一直在往这个 repo 里加东西 —

2026-02-11: 加入 Wide&Deep Deep Research Agent(独立 arXiv paper)
2026-03 左右: 加入 MCP+(独立项目网站 mcp-plus.github.io)
2026-04: 加入 MCPMark task runner — 直接在自己 repo 里跑别人家 benchmark
持续维护中: mcpuniverse/rl/ 整个子包,做 GRPO+verl 训练

所以 paper 是 2025-08 那个版本的 snapshot,而 repo 已经长成了一个"MCP agent 全栈平台": benchmark / research agent / context compression / RL trainer 四样东西揉在一起。

1.2 在 MCP 生态卡哪个位置

把 #21 里的四宫格再画一遍,但加入 framework 维度:

项目	团队	定位	是否有 RL 训练栈	frontier 引用
MCP-Universe	Salesforce AI Research	framework	✓ GRPO+verl	✗
MCP-Atlas (#19)	Scale AI + NUS	pure eval bench	✗	✓ Claude 4.7 card
Toolathlon / Toolathlon-Gym	—	eval + RL gym	✓ Gym API	✗
MCPMark	—	pure eval bench	✗	✗

关键 takeaway:

MCP-Universe 和 Toolathlon-Gym 是仅有的两个真正能跑 RL 训练的
但MCP-Universe 的 RL 栈集成度更深 — 直接用 verl 做后端,Hybrid / Fully-Async 双模式 + TITO 优化,而 Toolathlon-Gym 只提供 gym 接口让你自己接 trainer
代价: frontier model card 引用为 0。Anthropic Claude Opus 4.7 system card 引的是 MCP-Atlas (77.3%),不是 MCP-Universe

2 · 11 server / 6 domain / 231 task 详细拆解

2.1 6 个 domain 与配套 MCP server

Domain	主要 MCP server	Task 数	占比
Web Searching	Google Search MCP + Fetch MCP	55	23.8%
Location Navigation	Google Maps MCP	45	19.5%
Financial Analysis	Yahoo Finance (yfinance) MCP	40	17.3%
Browser Automation	Playwright MCP	39	16.9%
Repository Management	GitHub MCP	33	14.3%
3D Designing	Blender MCP (Salesforce 改版)	19	8.2%
总计		231	100%

11 server 里另外 4 个是辅助 server(跨 domain 共用): Notion / Weather / Date / Calculator。Paper 报告 总 tool 数 = 133(所有 server 合计暴露给 agent 的可调用函数)。

2.2 Task horizon / 步数统计

Paper 提到"the number of tokens increases rapidly as the number of interaction steps grows"。对于成功完成的 task,平均步数依模型差异在 4.82 – 8.22 步区间(GPT-5 偏多步、小模型在简单 task 上偏少步)。失败 task 经常触发 max_iterations ceiling。

2.3 task 配置文件长什么样

看一个真实 task JSON (来自 README, 简化):

{
  "category": "google-maps",
  "question": "Plan a 3-day trip from SF to LA via coastal route ...",
  "mcp_servers": [{"name": "google-maps"}, {"name": "weather"}],
  "output_format": {"itinerary": "[LIST]"},
  "evaluators": [
    {"func": "json -> get(itinerary) -> len", "op": ">=", "value": 3},
    {"func": "json", "op": "google_maps.check_route_valid", "op_args": {...}}
  ]
}

评估器是声明式 DSL: func 是抽值规则(json -> get(x) -> foreach -> get(y) -> len 这种链式),op 是比较运算或自定义检查函数,value 是 ground truth。这意味着加新 task 不需要写 Python — 只要会写 JSON。

3 · 评测 pipeline — Format / Static / Dynamic 三层

这是 MCP-Universe 和 MCP-Atlas 最大的方法学分歧点。

Paper 共定义 84 个 evaluator 实例,按类型分布:

类型	占比	作用	例子
Format Evaluator	4.8%	output schema / JSON 合规性	`output 必须有 itinerary 字段且为 list`
Static Evaluator	38.1%	时间不变的事实校验	"美国 50 个州的首都是哪些"这种历史固定答案
Dynamic Evaluator	57.1%	实时从真 API 拿 ground truth	"Tesla 今天收盘价" → eval 时再调一次 yfinance 拿当下答案

图1: MCP-Universe 评测 pipeline。Agent → 真实 MCP servers 多步交互 → 最终 JSON 答案 → 三类 evaluator 串行检查 → 全通过算 1 分。Dynamic 占比 57% 是与 MCP-Atlas (LLM judge) / MCPMark (静态答案) 最显著的区别。

3.1 显式拒绝 LLM-as-judge 的理由

"The framework explicitly avoids LLM-as-a-judge approaches due to static knowledge limitations and style bias concerns."

这是 MCP-Universe 在方法学上对 MCP-Atlas 的正面叫板。Atlas 用 Gemini-2.5-Pro 做 claims-based judge,信号干净但慢且贵,且训练数据可能 leak; Universe 选择"DSL + 真 API 实时校验",代价是为每个 task 写复杂的 evaluator config,但奖励信号不会随评判模型版本漂移,正好适合做 RL 训练的 reward function。

这一选择直接为 §4.4 的 RL 训练栈铺平了路 — 因为 RL 不能容忍 reward 不稳定。

4 · 🔍 开源现状 — repo 实地清点

本节是这篇笔记的重头。所有内容由 GitHub API 实地抓取 (2026-05-17),不是从 paper 抄来的。

Repo metadata: Apache-2.0 · 585 stars / 82 forks · 25,804 字节 README · Python · 最近 push 2026-04-20
Homepage: mcp-universe.github.io · Created: 2025-05-02 · Topic: 持续维护中

4.1 顶层仓库结构清单

路径	类型	作用
`mcpuniverse/`	dir	主 Python 包(下细分见 §4.1.b)
`tests/benchmark/`	dir	6 个 domain 各一个 `test_benchmark_*.py`入口
`docker/`	dir	Dockerfile / docker-compose
`docs/`	dir	Blender setup / python sandbox setup / custom evaluator 等指南
`third_party/`	dir	外部依赖(git submodule)
`assets/`	dir	图片 + icon
`blender_addon.py`	file (64 KB)	Salesforce 自家魔改的 Blender 插件
`setup_blender_and_vnc.sh`	file (13 KB)	一键 headless Blender + VNC 远程渲染
`.env.example`	file (2 KB)	API key 模板
`requirements.txt` + `dev-requirements.txt`	file	依赖
`pyproject.toml` · `Makefile` · `.pylintrc`	file	构建+lint

4.1.b · `mcpuniverse/` 主包再下一层

子包	作用
`agent/`	Agent 实现池: `basic.py` · `react.py` · `function_call.py` · `function_call_wide.py` (Wide&Deep) · `harmony_agent.py` (GPT-OSS) · `claude_code.py` · `react_train_agent.py` (训练专用) · `explore_and_exploit.py` · `reflection.py` · `openai_agent_sdk.py`
`benchmark/`	benchmark runner + 4 套配置: `configs/dummy` / `configs/mcpuniverse` (231 task 主体) / `configs/mcpmark` / `configs/deepresearch`
`evaluator/`	Evaluator 实现池: `blender` · `github` · `google_maps` · `google_search` · `notion` · `playwright` · `weather` · `yfinance` · `deepresearch` · `mcpmark` · `evaluator.py` · `functions.py`
`extensions/`	当前只有 `mcpplus/`(详见 §4.3)
`llm/`	多 provider 适配: openai · claude · gemini · deepseek · grok · mistral · ollama · openrouter · local_llm + 内部网关 (`sf_research_gateway` / `sf_llm_express_gateway`) + 一个 `tito/` 子包 (token-in-token-out 优化)
`mcp/`	MCP client/server 管理: `client.py` · `gateway.py` · `manager.py` · `env_pool/` (Docker container 池) · `servers/` · `permission.py`
`pipeline/`	orchestration 工作流
`rl/`	RL 训练栈 — 详见 §4.4(本节最关键)
`workflows/`	多 agent 编排: chain / router / orchestrator
`dashboard/`	Gradio 可视化界面
`app/`	FastAPI Web 后端
`tracer/` + `callbacks/`	trajectory 日志: MemoryCollector / FileCollector / SQLiteCollector
`common/`	工具函数

这种15 个一级子包的结构,在我读过的所有 MCP 仓库里规模仅次于 (或并列于) Toolathlon-Gym。MCP-Atlas 的 repo 体量大约只有它的 1/5。

4.2 Deep Research Agent (Wide&Deep, W&D)

这是一个独立的子项目,2026-02-11 加入 repo,配套自己的 arXiv paper (2602.07359) 和独立网站 xqlin98.github.io/wide-deep-research-agent/。

核心 idea: 不是单线深挖 (deep),而是每轮让 agent 并行发起多个 tool call 来扩"宽度"(wide)。
实现: mcpuniverse/agent/function_call_wide.py + function_call_wide_claude.py。
结果: W&D + GPT-5-medium → BrowseComp 62.2%,超过 GPT-5-high 自己跑 deep research mode 的 54.9%。同时减少 turn 数、API cost、wall-clock time。
支持数据集: BrowseComp / GAIA / HLE,每个数据集已经预制好 GPT-5 / Gemini-3-Pro / Claude-4.5-Sonnet 三套 yaml config (在 mcpuniverse/benchmark/configs/deepresearch/configs/<dataset>/)。

对用户的实际价值: 如果你要做 deep research agent baseline,直接复用 W&D 配置; 三套 frontier model 已经 hard-code 好。tool 是 serper-search + jina-scrape-llm-summary + python-code-sandbox(HLE 用)。

4.3 MCP+ — 独立 CLI 子项目, 给现有 mcp.json 加 -plus 包装

MCP+ 是 repo 里的 extension,但有自己的项目页 mcp-plus.github.io,以及独立 CLI 入口 mcp-build-plus。装 mcpuniverse 包就能用。

它要解决的问题:

"MCP tools often return large, verbose outputs (web pages, API responses, file contents). Sending these directly to your LLM wastes context and money. MCP+ wraps your MCP clients with intelligent post-processing that extracts only the relevant information before it reaches your LLM." — README

用法 (从 mcpuniverse 自己的 README copy):

# 装包
pip install mcpuniverse

# 给现有 Cursor / Claude Code 的 mcp.json 一键加 -plus 后缀
mcp-build-plus --mcp-config ~/.cursor/mcp.json

# 或只包装特定 server
mcp-build-plus --mcp-config ~/.cursor/mcp.json --servers github playwright

# 调 token 阈值 (高于这个长度才会触发后处理)
mcp-build-plus --mcp-config ~/.cursor/mcp.json --token-threshold 2000

# 换 cheaper 模型做后处理
mcp-build-plus --mcp-config ~/.cursor/mcp.json --llm-model gpt-5-mini

# Gemini / Anthropic 后端都支持
mcp-build-plus --mcp-config ~/.cursor/mcp.json \
    --llm-provider gemini --llm-model gemini-2.5-flash \
    --llm-api-key-env GOOGLE_API_KEY

它生成一个新的 mcp.json,把每个 server 替换成 server-plus 包装版,重启 Cursor 即可用。声称 50–75% token 节省 / 零代码改动。是个非常 product-y 的功能 — Salesforce 显然把它当成能脱离 benchmark 独立卖的产品。

4.4 ⭐ RL 训练栈 — 真的有(verbatim evidence)

这是和 MCP-Atlas / MCPMark 最大的差异点。看 mcpuniverse/rl/README.md 第一段(verbatim):

"Reinforcement learning for LLM agents that interact with tools via the Model Context Protocol (MCP). The agent acts inside real tool environments. Each trajectory is a multi-turn conversation where the model makes tool calls, receives results, and continues reasoning until it completes the task or hits the iteration limit. Rewards are computed by task-specific evaluators."
— mcpuniverse/rl/README.md, 7,420 字节

更关键的算法 verbatim 摘录:

"The training algorithm is GRPO (Group Relative Policy Optimization): multiple trajectories are generated per prompt, and advantages are normalized within each group. No learned critic or reward model is needed."

4.4.a · 训练栈关键组件

组件	文件	作用
MCPLoopManager	`rl/integrations/verl/mcp_loop_manager.py`	批量并行 rollout 的多轮工具调用循环。支持 Text 模式(HTTP API to vLLM/sglang) + TITO 模式(token id 直送 vLLM,省 tokenize/detokenize)
MCPRewardManager	`rl/integrations/verl/mcp_reward_manager.py`	用 §3 的 evaluator 算 reward(不是 learned reward model)
MCP Transport	`mcp/env_pool/`	三种模式: `stdio`(每条 trajectory 新进程) / `sse`(共享 gateway) / `docker_pool`(每条 trajectory 一个 Docker container,完全隔离)
Agent 格式	`rl/formatters/{gpt_oss,qwen3}.py`	GPT-OSS HarmonyReAct + Qwen3 ReAct 各一套 prompt/parser

4.4.b · 两种训练模式

对比维度	Hybrid	Fully Async
GPU 池	Actor + Rollout + Critic 共享同 GPU	Rollouter 和 Trainer 分开 GPU 池
执行	同步,rollout/train 交替	并行,Rollouter 不停产 trajectory
最低 GPU 数	1	2(推荐 8 = 4+4)
GPU 利用率	较低	较高
权重同步	—	每 N step Trainer 通过 NCCL broadcast 推权重给 Rollouter,vLLM engine 热加载
入口	`rl.integrations.verl.hybrid.mcp_main_ppo`	`rl.integrations.verl.fully_async.mcp_async_main`

4.4.c · 用什么 base model + 训什么

从 rl/examples/ 目录的三个 notebook 命名能反推:

vllm_harmony_react.ipynb — GPT-OSS 系(Harmony 格式 = OpenAI gpt-oss-20b/120b)
vllm_react_train.ipynb — 通用 ReAct(Qwen3 等)
vllm_tito_react_train.ipynb — Qwen3 + TITO 优化

训练数据格式 (verbatim from README):

{
  "instance_id": "task_001",
  "instruction": "Calculate the final value if I invested $25,000 in MSFT...",
  "output_format": {"total value": "[NUMBER]"},
  "mcp_servers": [{"name": "yfinance"}, {"name": "calculator"}],
  "dockerfile_path": "/path/to/Dockerfile",
  "evaluators": [{
    "func": "json",
    "op": "yfinance.check_portfolio_task_output",
    "op_args": {"tickers": ["MSFT"], "start_date": "2023-01-09"},
    "desc": "Check whether the final portfolio value is correct."
  }]
}

注意 evaluators 字段和 §2.3 benchmark task 的 evaluator 完全同一 schema — 这意味着 benchmark task 可以无缝复用为 RL training task,RL trained agent 也可以直接在 benchmark 上跑。这是 framework-grade 的设计。

4.4.d · 一行命令开训

# Hybrid mode (单卡可跑)
python -m mcpuniverse.rl.integrations.verl.hybrid.mcp_main_ppo \
    --config-path=integrations/verl/config \
    --config-name=mcp_harmony_tito_example \
    actor_rollout_ref.model.path=/path/to/model \
    data.train_files=/path/to/train.json \
    data.val_files=/path/to/val.json

# Fully Async (4+4 GPU)
python -m mcpuniverse.rl.integrations.verl.fully_async.mcp_async_main \
    --config-name=mcp_fully_async_harmony_tito_example \
    trainer.n_gpus_per_node=4 \
    rollout.n_gpus_per_node=4

# 多节点 (两 pod)
bash integrations/verl/scripts/start_multinode_async.sh

对比 Toolathlon-Gym: Toolathlon 只提供 OpenAI Gym 兼容接口让用户自己接 trainer,MCP-Universe 直接打包了 verl + 一套训练配置 + 多模式 + TITO 优化,集成深得多。这一点 paper 里完全没提,只能从 repo 看到 — 因为这部分代码大概是在 paper 投稿之后才加的。

4.5 HuggingFace artifacts inventory

我在 huggingface.co/Salesforce(184 model / 59 dataset / 10 space)和 huggingface.co/SalesforceAIResearch(404)两个 org 下都搜过,以及在 paper 页上检查,结论:

未发现 MCP-Universe 在 HF 上发布任何专属 model / dataset / space。 231 个 task 的 ground truth 数据全部以 mcpuniverse/benchmark/configs/mcpuniverse/<domain>/ 下的 YAML+JSON 形式直接放在 GitHub,而不是上 HF datasets。也未发现 Salesforce 用 MCP-Universe RL 栈训出的 checkpoint 在 HF 上发布。

这一点和 #22 TOUCAN(在 HF 上 1.5M dataset + 27 checkpoint 全开源)、#23 EnvScaler(HF model + dataset)形成鲜明对比 — MCP-Universe 是"代码栈开源,模型/数据完全在 GitHub repo 里"的路线。

4.6 Self-host 从 git clone 到第一个分数的完整命令

# 1. clone
git clone https://github.com/SalesforceAIResearch/MCP-Universe.git
cd MCP-Universe

# 2. Python 环境
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -r dev-requirements.txt

# 3. 系统依赖 (Linux)
sudo apt-get install libpq-dev   # PostgreSQL header (optional storage)

# 4. API key
cp .env.example .env
# 编辑 .env 填:
#   OPENAI_API_KEY       (必须,跑 GPT-5 评估)
#   SERP_API_KEY         (web search domain)
#   GOOGLE_MAPS_API_KEY  (location nav)
#   GITHUB_PERSONAL_ACCESS_TOKEN + GITHUB_PERSONAL_ACCOUNT_NAME (repo mgmt — ⚠ 用专用测试号)
#   NOTION_API_KEY + NOTION_ROOT_PAGE
#   BLENDER_APP_PATH     (3D design — Blender 4.4.0)
#   MCPUniverse_DIR      (repo 绝对路径)

# 5. (3D 任务) 安装 Blender + VNC
bash setup_blender_and_vnc.sh

# 6. 跑单 domain
export PYTHONPATH=.
python tests/benchmark/mcpuniverse/test_benchmark_web_search.py

# 7. 跑全部 6 个 domain
for d in location_navigation browser_automation financial_analysis \
         repository_management web_search 3d_design; do
    python tests/benchmark/mcpuniverse/test_benchmark_${d}.py
done

4.7 与 #19 MCP-Atlas / Toolathlon-Gym 开源度横向对比

项目	Star (2026-05)	License	task config 开源	RL 训练代码	HF artifact	评分代码	提交流程
MCP-Universe	585	Apache-2.0	✓ 全 231 task	✓ GRPO+verl	✗	✓ DSL evaluator	提 PR / Discord
MCP-Atlas (#19)	~150	—	仅 500 公开 (500 hold-out)	✗	✗	✓ Gemini judge	Live leaderboard 自动
Toolathlon-Gym	~250	Apache-2.0	✓ 全 task	✓ (gym API)	有	✓	提 PR

开源度 ranking: MCP-Universe ≥ Toolathlon-Gym > MCP-Atlas。MCP-Universe 在 RL 栈维度领先,但 HF 模型/数据维度落后于 Toolathlon-Gym 和 TOUCAN/EnvScaler 这种 SFT 系工作。

5 · 实验结果与 SOTA 分数

Paper 在 §4 给出完整 leaderboard(数字为 success rate):

Model	Loc Nav	Repo Mgmt	Financial	3D Design	Browser	Web Search	Overall
GPT-5	33.33	30.30	67.50	52.63	35.90	45.45	43.72
Grok-4	28.89	12.12	40.00	26.32	41.03	41.82	33.33
Claude-4.0-Sonnet	22.22	12.12	55.00	26.32	38.46	21.82	29.44
DeepSeek-V3	11.11	6.06	30.00	26.32	12.82	7.27	14.29
GPT-OSS-120B	6.67	6.06	35.00	10.53	5.13	5.45	11.26

5.1 几个观察

SOTA 也只 43.72% — 离"可用"很远。这是它在媒体头条里反复出现的卖点("即使 GPT-5 也只能跑到 43%")
Financial Analysis 是简单 domain (67.50%) — 因为 yfinance API 干净、答案数字化、evaluator 易写
Repo Mgmt 最难 (GPT-5 也只 30.30%) — GitHub 真实操作多步、副作用大、容错低
开源模型差距巨大: DeepSeek-V3 14.29% / GPT-OSS-120B 11.26% — 相比 GPT-5 差 3–4 倍
3D Design Claude/Grok 都 26.32%,但 GPT-5 一下跳到 52.63% — 说明 spatial reasoning 在 GPT-5 上有显著提升

这种"全场 30–45%"的 benchmark 数字,从 RL 训练角度看是非常好的 signal 区间 — 既不是 saturated (像 BFCL v3 GPT-4.5 已经 70+),也不是地板效应 (像 OSWorld base agent <10%),正好给 GRPO trajectory-level reward 留出梯度。这反向印证了 §4.4 RL 栈的实际可用性。

6 · 与 Salesforce 自家模型 (xLAM) 的关系

Salesforce 在 agent / function-calling 方向有自己的旗舰模型系列 xLAM (Large Action Model),典型如 Salesforce/xLAM-8x22b-r。一个直觉问题是: xLAM 有没有用 MCP-Universe 评测、或者用 MCP-Universe RL 栈训练?

检索结果:

MCP-Universe paper 里没有 xLAM 这个 baseline。Paper 的 leaderboard 只测了 frontier 闭源模型 + DeepSeek + GPT-OSS
xLAM v2 公告 blog 也没引用 MCP-Universe(2025-08 时间线上两者基本同期发布)
repo 里没有任何 xlam 关键字; llm/ 子包也未列出 xLAM 适配器
但是 llm/sf_llm_express_gateway.py / sf_research_gateway.py 透露有内部网关,极可能在 Salesforce 内部跑 xLAM 评测,只是不在公开 leaderboard 出现

这构成一个略尴尬的现象 — Salesforce 自家做了一个号称"RL training + benchmarking framework"的项目,但 Salesforce 自家的 LAM 模型在公开数字里完全缺席。可能解释:

xLAM 是 fine-tuned 给 function-calling 的,跑 MCP-Universe 这种 long-horizon 任务结果不好看,内部 PR 决定不发
xLAM 还在等更稳定的版本(xLAM v2 multi-turn)再跑公开评测
战略上把 MCP-Universe 定位为"对外的开源贡献"而不是"xLAM 的 fine-tune 工具"

个人倾向 (1)+(3) 的组合。MCP-Universe 显然更想被引用为"研究界的 MCP 标准 framework",而不是给 Einstein / xLAM 卖力。

7 · 与同类 MCP benchmark / framework 横向对比

	MCP-Universe	MCP-Atlas (#19)	Toolathlon	MCPMark	TOUCAN (#22) 数据视角
定位	framework	eval bench	eval + gym	eval bench	SFT 数据
规模	11 server / 231 task	36 server / 220 tool / 1,000 task	~600 task	~500 task	495 server / 1.5M trajectory
评分	DSL + real API (no LLM judge)	Gemini-2.5-Pro claims judge	规则	规则	GPT-OSS-120B Likert
RL 训练栈	✓ GRPO+verl	✗	✓ gym API	✗	(SFT not RL)
子项目	Deep Research / MCP+ / MCPMark runner	—	—	—	—
HF artifacts	✗	✗	有	有	✓ 27 ckpt + 1.5M data
Frontier 模型 card 引用	✗	✓ Claude 4.7	✗	✗	✗
GitHub Star (2026-05)	585	~150	~250	~100	—
团队	Salesforce AI	Scale AI + NUS	—	—	MIT-IBM + UW

8 · 局限 / 个人 take

8.1 局限

Frontier 引用为零。 Salesforce 投了非常多工程,但 Anthropic / OpenAI / Google 的 system card 都不引用 MCP-Universe — 引用的是 MCP-Atlas (Claude Opus 4.7 system card 77.3%)。这是个明显的"投入 != 影响力"的反例
231 task 规模偏小。比 MCP-Atlas (1,000) / Toolathlon (~600) 都小,且 3D Design 只有 19 个 task,统计上不稳。Salesforce 没有像 #22 TOUCAN / #23 EnvScaler 那样做规模化扩张
Dynamic evaluator 有运行时风险。 57.1% 评估器要在评测时实时调真 API,意味着评分依赖 API 状态 — Yahoo Finance / Google Maps 改 API、限流、停服时整套评测会跑不动
未在 HF 上发模型/数据。这降低了被 fine-tune 工作引用的概率(因为大家习惯 from datasets import load_dataset)
RL 栈没有 published checkpoint 验证。 paper 没报告"用 MCP-Universe RL 训出来的模型在 MCP-Universe 上是多少分"。这套 GRPO+verl 集成的实际收益从未在公开数字上被自证。和 Toolathlon-Gym / EnvScaler 不同,后者都有 trained checkpoint 数字
xLAM 在自家 benchmark 上缺席(§6)

8.2 个人 take

MCP-Universe 是 2025-08–2026-05 期间所有 MCP 项目里工程投入最大的一个,但影响力被 MCP-Atlas 截走。

Salesforce 这一手的真实战略意图,在我看是"占住 MCP framework 这一标签"而不是"赢 leaderboard"。证据: (a) repo description 用 framework 不是 benchmark; (b) 后续不断加 Deep Research / MCP+ / MCPMark runner,把自己变成"MCP 元平台"; (c) MCP+ 已经独立出 mcp-plus.github.io,有 product-y 包装; (d) 不在 HF 上发模型 — 不主动给 fine-tune 派系送弹药,留住 framework 入口。

这套打法的致命弱点: 没有一个"frontier model 必须引用我"的钩子。Atlas 通过500 个 hold-out task + live leaderboard + Gemini judge构造了"想要刷分必须用我"的反向引用; Universe 没有这种 lock-in。

但对个人研究者最有用: 如果你要做 MCP agent 的 RL 训练,MCP-Universe 几乎是唯一能 pip install + 1 行命令开训 的开源选择(verl + Hybrid/Async + TITO + docker_pool 全栈),Toolathlon-Gym 只给你 gym 接口、verl 配置要你自己写。所以 — paper 不一定要引,但代码可以用。

8.3 三个具体的"如果你想用 MCP-Universe,可以干这些事"

跑一个 Qwen3-8B 的 MCP agentic RL 训练: git clone → pip install → 准备 train.json (复用 benchmark 的 231 task 改 schema) → python -m mcpuniverse.rl.integrations.verl.fully_async.mcp_async_main。4+4 GPU 起步,3 天能看到第一波 reward 曲线
给自己的 Cursor / Claude Code 加 MCP+ 包装,省 token cost: pip install mcpuniverse; mcp-build-plus --mcp-config ~/.cursor/mcp.json --llm-model gpt-5-mini --token-threshold 1000 — 重启 Cursor,声称 50-75% 节省
跑 BrowseComp / GAIA / HLE 的 W&D deep research baseline: 直接用 mcpuniverse/benchmark/configs/deepresearch/configs/{browsecomp,gaia,hle}/agent_wide_research_*_gpt5.yaml,改 LLM section 换成自己模型,一行命令跑出对比数

原始信息来源

Paper: arXiv:2508.14704 / html v1
GitHub: SalesforceAIResearch/MCP-Universe
RL README (verbatim quoted): mcpuniverse/rl/README.md
MCP+ extension: extensions/mcpplus · mcp-plus.github.io
Deep Research README: configs/deepresearch/README.md · W&D paper arXiv 2602.07359
Website + Leaderboard: mcp-universe.github.io
Salesforce blog: salesforce.com/blog/mcp-universe
HF (paper page, 无 dataset/model): huggingface.co/papers/2508.14704
xLAM (Salesforce 自家 model,未在 paper 中评测): huggingface.co/Salesforce/xLAM-8x22b-r · xLAM v2 blog
对比阅读: #19 MCP-Atlas · #20 SETA · #21 4 MCP Bench 横评 · #22 TOUCAN · #23 EnvScaler · #24 Smithery