BFCL (Berkeley Function-Calling Leaderboard) — function-calling 事实标准

UC Berkeley · Gorilla LLM Team (Sky Computing Lab) · Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, Joseph E. Gonzalez · ICML 2025 (paper) · V1: 2024-02-26 → V2: 2024-08-19 → V3: 2024-09-19 → V4: 2025-07-17
Live leaderboard · GitHub: ShishirPatil/gorilla · HF Dataset · ICML 2025 paper · PyPI: bfcl-eval
关键词: function-calling · tool use · AST eval · multi-turn · agentic · Web Search · Memory · Format Sensitivity · Apache-2.0

速读卡片 (TL;DR)

一句话: BFCL 是 UC Berkeley Sky Computing Lab(Gorilla LLM 团队 / Patil-Mao-Stoica-Gonzalez)在 2024-02 推出的 第一个综合且可执行的 function-call 评测,自身定位"the first comprehensive and executable function call evaluation",目前是 function-calling 领域的事实标准(de-facto standard) —— 微软 / Salesforce / Anthropic / Google / OpenAI 的 frontier model card 和 Qwen3 / GLM / Kimi / DeepSeek / xLAM / Hammer / Granite / Functionary 等几乎所有面向"工具调用"的小/中模型都把 BFCL 当训练目标。

V4 (2025-07-17)

现行版本 — 加入 agentic Web Search + Memory + Format Sensitivity

~4,951

scoring task 数量(2,251 Live + 1,610 Non-Live + 1,000 Multi-Turn + ~90 agentic)

GLM-4.5 70.9%

V4 当前 SOTA(open-weight 反超 GPT-5 59.2%)

Apache-2.0

repo + dataset + 评测脚本全开源 · 49,556 下载/月

立场: 在 #21 MCP benchmark 横评和 #26 代码级深潜之后,我们已经清楚地看到 MCP benchmark(MCP-Universe / Atlas / Toolathlon / MCPMark)是 stateful multi-system benchmark,BFCL 是 schema-and-trajectory benchmark —— 两者评的是不同维度。BFCL 高分(GLM-4.5 70.9% / Qwen3-32B 75.7% V3)不等于 MCP-Atlas / Universe 高分(GPT-5 在 MCP-Universe 43.72%,Claude Opus 4.7 在 MCP-Atlas 77.3%)。但 BFCL 仍然是训 ≤40B 函数调用模型的必经第一站: schema 守规矩 → live 数据健壮 → multi-turn 状态保持 → agentic 调度,这条阶梯的第一阶就是 BFCL,而且它是唯一有完整 AST 自动验证、不需要 LLM judge 的 schema-level benchmark。

§1 BFCL 是什么 — 为什么它是 function-calling 事实标准

BFCL 团队对自己的定位 (README, verbatim):

"We introduce the Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability."
— repo README

而 ICML 2025 paper("The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models")abstract 最后一句说得更直接:

"Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html."
— Patil et al., ICML 2025, abstract verbatim

1.1 "事实标准"具体是什么意思

"function-calling 事实标准"这个说法分几层验证:

训练目标: 几乎所有 ≤40B 面向 tool 的模型(xLAM, Hammer, Functionary, NexusRaven, Granite, ToolACE, BitAgent, Llama-xLAM-2, INTELLECT-3 …)都明确地把 BFCL 作为主要 reporting metric(见 #27 ≤40B 小模型 MCP landscape)。
论文报数: #22 TOUCAN 在 BFCL V3 报 70.45%, #23 EnvScaler 在 BFCL-MT 报 41.88%, #18 AWM 在 BFCL V3 报 53.83 → 65.94 OOD。全部以 BFCL 为核心 OOD 测试。
frontier model card: GPT-5.5 system card 明确把 BFCL 列为 13K-task CoT-Control 训练集源之一(见 §9)。
HF 下载: 49,556 downloads/月(2026-05, HF)。同期 MCPMark/Atlas/Universe/Toolathlon 加起来都没这个量级。
Apache-2.0 全套: 数据 + 评测脚本 + AST checker + state checker 全开源,不像 MCP 系列要么 LLM judge 不可复现要么 OAuth 缺失。

1.2 团队 & 学术血统

组: UC Berkeley Sky Computing Lab (前 RISE Lab) · Gorilla LLM 团队
负责人: Shishir G. Patil(PhD,Berkeley)、Huanzhi Mao(Berkeley)
资深 faculty: Ion Stoica(Spark / Anyscale / vLLM)+ Joseph E. Gonzalez(BAIR)— 这俩名字直接保证了 BFCL 不会沦为玩具,vLLM 自带 BFCL 测试就是同一拨人
姐妹项目: Gorilla: Large Language Model Connected with Massive APIs(2023-05, arXiv:2305.15334)→ OpenFunctions(2024-02 首发的 V1 BFCL,代码仓库里的核心 evaluator 还叫 openfunctions_evaluation.py)
邮件: huanzhimao@berkeley.edu(README 末尾给出的官方联系方式)
Discord 频道: discord.gg/grXXvj9Whz 的 #leaderboard

§2 历史 — V1 → V2 → V3 → V4 演化

四代 BFCL 的演化轨迹非常清晰,每代解决前一代暴露出来的"评测过于简单 → 过拟合 → 该测的没测"的问题。下面 SVG 是时间线 + 每代新增能力的可视化。

BFCL 四代演化时间线 — 时间和数字均 verbatim 来自 CHANGELOG.md 和 HF dataset card。V1 日期取自 HF dataset card "Original Release: 02/26/2024"。

2.1 各版本 verbatim 说明

V1 — Simple/Parallel/Multiple Function Call eval with AST (2024-02-26)

HF dataset card 给出"Original Release: 02/26/2024"。V1 引入了核心创新 —— 用 AST(抽象语法树)做参数与类型匹配,而不是把模型输出当字符串去比较或者真的去执行。原始 categories:

AST 类: Simple Python 400 · Java 100 · JavaScript 50 · Multiple Python 200 · Parallel Python 200 · Parallel Multiple Python 200
Executable 类: Simple 100 · REST 70 · Multiple 50 · Parallel 50 · Parallel Multiple 40
Relevance/Irrelevance Detection: 240
合计 ~1,700 entries

V2 — Enterprise and OSS-contributed Live Data (2024-08-19)

CHANGELOG verbatim:

"[August 19, 2024] #580: Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our blog. All CLI commands have been updated to support the new dataset."

V2 Live 分布(verbatim 自 HF dataset card):

Simple 258 · Multiple 1,053 · Parallel 16 · Parallel Multiple 24 · Irrelevance Detection 882 · Relevance Detection 18
Total: 2,251 question-function-answer pairs(数字差异: HF dataset card 写 882,blog 早期写 875; 我们取 HF 上的最新数)

V3 — Multi-Turn & Multi-Step Function Call Evaluation (2024-09-19)

CHANGELOG verbatim:

"[Sept 19, 2024] #644: BFCL V3 release: Introduce new multi-turn dataset and state-based evaluation metric."

V3 的 1,000 条 multi-turn(官方 blog):

Base Multi-Turn 200 — Foundational interactions with all necessary information provided across turns
Augmented Multi-Turn 800:
- Missing Parameters 200 — model 必须主动追问
- Missing Functions 200 — model 必须识别"我没有这个工具"并拒绝
- Long-Context 200 — 多轮上下文长度 stress
- Composite 200 — 三种增强组合

V4 — Agentic Web Search + Memory + Format Sensitivity (2025-07-17)

这是现行版本。CHANGELOG #1019 verbatim 给出 V4 的所有改动:

"[Jul 17, 2025] #1019: BFCL V4 release:

New agentic domain — Introduces the agentic domain with two categories: Web Search and Memory Management.

Revised overall-accuracy formula — As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks."

新权重表(这是 V4 最关键的变化):

Segment	Old %	New %
Live(V2 用户贡献)	33	10
Non-Live(V1 学术合成)	33	10
Irrelevance Detection	0	10
Multi-Turn(V3)	33	30
Agentic(V4 新)	0	40

除此之外 V4 还做了:

退役 Executable categories(rest / exec_simple / exec_parallel / exec_multiple / exec_parallel_multiple)— 2025-04-09 #943
category 改名: simple → simple_python,java → simple_java,javascript → simple_javascript
目录结构改为两级 hierarchy: result/<model>/<general_category>/<category>.json,其中 general_category ∈ {non_live, live, multi_turn, agentic, format_sensitivity}
支持的 frontier model 显式列出:claude-opus-4-1-20250805 · gpt-5-2025-08-07 · gpt-5-mini-2025-08-07 · gpt-5-nano-2025-08-07 · Qwen/Qwen3-30B-A3B-Instruct-2507 · Qwen/Qwen3-235B-A22B-Instruct-2507 · Qwen/Qwen3-4B-Instruct-2507

V4 三个 Part 的细节

V4 Part 1 — Agentic Web Search(blog 15): 2025-07-17 发布。引入 multihop QA 任务 —— "100 human-crafted multihop questions spanning various domains"。需要 SerpAPI 才能跑(README 明确说: "For the web_search test category, we use the SerpAPI service to perform web search. You need to sign up for an API key…")。子 category 有 web_search_base 和 web_search_no_snippet(后者强制模型 fetch + read webpage 而不是只用 search snippet)。

V4 Part 2 — Agentic Memory Management(blog 16): 同日发布,150+ 题。覆盖 5 个 domain —— College Student Advising / Customer Support / Personal To-Do List / Healthcare Patient / Finance Managing Director。3 种 memory backend(灵感来自 MemGPT / Mem0):

memory_kv: key-value store · API: memory_add() · memory_remove() · memory_clear() · memory_search()
memory_vector: 向量数据库 · 同 API 接口
memory_rec_sum: recursive summarization · API: memory_append() · memory_update() · memory_replace() · memory_clear() · memory_retrieve()

V4 Part 3 — Agentic Format Sensitivity(blog 17): 这一项不进入总分,只作 diagnostic — 看 prompt 模板小改动会不会让 prompting-mode 模型崩溃。

§3 task 组成详解

BFCL V4 一共 25 个 individual test category(从 TEST_CATEGORIES.md verbatim 抽取)。按 general_category 分组如下:

group	category(repo 内 ID)	说明	来源
non_live (V1 学术合成)	`simple_python`	简单 Python 单函数调用	V1
	`simple_java`	Java 单函数调用	V1
	`simple_javascript`	JavaScript 单函数调用	V1
	`multiple`	多个备选 function 中选一个调用	V1
	`parallel`	同一函数并行多次调用	V1
	`parallel_multiple`	多函数并行多次调用	V1
irrelevance	`irrelevance`	function doc 全部不相关 → 模型必须拒绝调用	V1
live (V2 用户贡献)	`live_simple`	用户贡献的真实简单调用	V2
	`live_multiple`	用户贡献多函数	V2
	`live_parallel`	用户贡献并行	V2
	`live_parallel_multiple`	用户贡献多函数并行	V2
	`live_irrelevance`	用户贡献的"应该拒绝"场景	V2
	`live_relevance`	用户贡献的"应该调用"场景	V2
multi_turn (V3 状态)	`multi_turn_base`	200 — 基础多轮 + 全部信息齐备	V3
	`multi_turn_miss_func`	200 — 缺函数 → 应拒绝	V3
	`multi_turn_miss_param`	200 — 缺参数 → 应追问	V3
	`multi_turn_long_context`	200 — 长上下文 stress	V3
agentic (V4)	`memory_kv`	KV store 后端的 read/write/search	V4
	`memory_vector`	向量库后端	V4
	`memory_rec_sum`	recursive summarization 后端	V4
	`web_search_base`	SerpAPI multihop QA	V4
	`web_search_no_snippet`	强制 fetch + read webpage	V4
diag(不计分)	`format_sensitivity`	system prompt 格式扰动诊断	V4

V4 增强子集还包括: composite (3 种 multi-turn 增强组合), but blog 13 把它划归 augmented multi-turn 类。

category 描述 verbatim(README 截选)

- simple_python: Simple Python function calls. Part of the "non-live simple"
  category on the leaderboard.
- multiple: Multiple function calls in sequence.
- parallel: Multiple function calls in parallel.
- parallel_multiple: Multiple function calls in parallel and in sequence.
- irrelevance: Function calls with irrelevant function documentation.
- multi_turn_base: Base entries for multi-turn function calls.
- multi_turn_miss_func: Multi-turn function calls with missing function.
- multi_turn_miss_param: Multi-turn function calls with missing parameter.
- multi_turn_long_context: Multi-turn function calls with long context.
- memory_kv: Tests reading from and writing to a key-value memory backend.
- memory_vector: Tests reading from and writing to a vector-database memory backend.
- memory_rec_sum: Tests reading from and writing to a recursive-summarization memory backend.
- web_search_base: Base entries for web-search calls.
- web_search_no_snippet: Web-search calls where search-engine snippets are withheld,
  forcing the model to fetch and read webpages.
- format_sensitivity: Various system prompt formats to test the format sensitivity
  of the model. (Only works for prompting mode models.)

3.1 几个 prompt template 片段

Simple Python (V1) — 一道典型 simple_python 题的 ground truth(改写自 repo 中的 simple_python_x):

{
  "id": "simple_python_3",
  "question": [[{
    "role":"user",
    "content":"Calculate the area of a triangle with base 10 and height 5."
  }]],
  "function": [{
    "name": "calculate_triangle_area",
    "description": "Calculate the area of a triangle given its base and height.",
    "parameters": {"type":"dict",
      "properties":{
        "base":{"type":"integer","description":"Base length in units"},
        "height":{"type":"integer","description":"Height in units"},
        "unit":{"type":"string","description":"Unit (cm, m, etc.)","default":"units"}
      },
      "required":["base","height"]}
  }],
  "possible_answer": [{
    "calculate_triangle_area":{
      "base":[10], "height":[5], "unit":["units","cm","m"]
    }
  }]
}

注意 possible_answer 中 unit 给了多个允许值 —— BFCL AST 允许 多 ground-truth(set membership),这是它和 string-match eval 的核心区别。

Multi-Turn (V3) — multi_turn_base 题目长这样(改写示意):

{
  "id":"multi_turn_base_15",
  "initial_config":{
    "TravelAPI":{...}, "GorillaFileSystem":{...}
  },
  "involved_classes":["TravelAPI","MessageAPI"],
  "question":[
    [{"role":"user","content":"Book me a flight from SFO to JFK on May 20."}],
    [{"role":"user","content":"Wait, change it to May 22 and message my partner."}]
  ],
  "ground_truth":[
    ["TravelAPI.book_flight(origin='SFO', dest='JFK', date='2026-05-20')"],
    ["TravelAPI.modify_booking(booking_id='...', new_date='2026-05-22')",
     "MessageAPI.send(recipient='partner', body='Flight changed to May 22')"]
  ]
}

3.2 V4 各类任务举例 — verbatim 从 repo 抽取

关于本节: 以下所有样例都是直接从 raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/bfcl_eval/data/ 的 JSON 抽取的第一条或前几条数据,2026-05 拉取。配套的 possible_answer/*.json 给出 ground truth,multi_turn_func_doc/*.json 给出工具签名,eval_checker/multi_turn_eval/func_source_code/*.py 给出后端实现。读完这节你应该能具体说出每类任务"模型看到什么 / 必须输出什么 / 怎么判分"。

3.2.1 non-live · `simple_python` — 单函数 + 多 ground-truth

上面 §3.1 已展示;这里补一个 parallel(同一函数并行调多次)的 verbatim:

// BFCL_v4_parallel.json (第 1 条)
{"id": "parallel_0",
 "question": [[{"role": "user",
   "content": "Play songs from the artists Taylor Swift and Maroon 5,
                with a play time of 20 minutes and 15 minutes respectively, on Spotify."}]],
 "function": [{"name": "spotify.play",
   "description": "Play specific tracks from a given artist for a specific time duration.",
   "parameters": {"type": "dict",
     "properties": {
       "artist": {"type": "string", "description": "..."},
       "duration": {"type": "integer", "description": "..."}},
     "required": ["artist", "duration"]}}]}

难点: 模型必须输出两次 spotify.play 调用而不是一次带 list 的调用 —— AST 检查严格区分 parallel-call vs list-arg。

3.2.2 non-live · `irrelevance` — 必须不调工具

// BFCL_v4_irrelevance.json (第 1 条)
{"id": "irrelevance_0",
 "question": [[{"role": "user",
   "content": "Calculate the area of a triangle given the base is 10 meters and height is 5 meters."}]],
 "function": [{"name": "determine_body_mass_index",
   "description": "Calculate body mass index given weight and height.",
   "parameters": {"properties": {
     "weight": {"type": "float"}, "height": {"type": "float"}},
     "required": ["weight", "height"]}}]}

难点: 工具名/描述里有 height 这种诱导词。BMI 的 height 是身高,triangle 的 height 是几何高度。模型只要被 keyword 钓中就要 0 分。leaderboard 上 Irrelevance 列就是这种"拒答"准确率。

3.2.3 multi_turn · `multi_turn_base` — 4 轮文件操作

// BFCL_v4_multi_turn_base.json (第 1 条)
{"id": "multi_turn_base_0",
 "question": [
   [{"role":"user","content":"Move 'final_report.pdf' within document directory
                              to 'temp' directory in document. Make sure to create the directory"}],
   [{"role":"user","content":"Perform a detailed search using grep to identify sections
                              in the file pertaining to 'budget analysis'."}],
   [{"role":"user","content":"Upon identifying the requisite 'budget analysis' content,
                              sort the 'final_report.pdf' by line ..."}],
   [{"role":"user","content":"Move 'previous_report.pdf' in document directory to temp ...
                              proceed to juxtapose it with 'previous_report.pdf' ..."}]],
 "initial_config": {
   "GorillaFileSystem": {"root":{"workspace":{"type":"directory",
     "contents":{"document":{"type":"directory",
       "contents":{"final_report.pdf":{"type":"file",
         "content":"Year2024 This is the final report content including budget analysis ..."},
         "previous_report.pdf":{...}}},
       "archive":{"type":"directory","contents":{}}}}}},
   "TwitterAPI": {"tweet_counter":3, "tweets":{...},
     "username":"analyst_pro", "password":"Kj8#mP9$vL2"}},
 "path": ["GorillaFileSystem.find","GorillaFileSystem.mv","GorillaFileSystem.grep",
          "GorillaFileSystem.sort","GorillaFileSystem.diff","TwitterAPI.post_tweet"],
 "involved_classes": ["TwitterAPI","GorillaFileSystem"],
 "excluded_function": ["cp"]}

注意要点:

initial_config 是 Python class 的初始状态(包括明文 password!),由 GorillaFileSystem / TwitterAPI 等 14+ 个仿真后端类承载。
path 是必经函数列表(state-based eval 验证);excluded_function 是禁用清单(如 cp 不能用,只能 mv)。
评测时每轮调用结束后,checker 把仿真器的真实 state 和 ground-truth state 用 state_diff 比较 —— 详见 §4.3。

3.2.4 multi_turn · `multi_turn_miss_func` — 空轮 = 工具被故意拿掉

// BFCL_v4_multi_turn_miss_func.json (第 1 条)
{"id": "multi_turn_miss_func_0",
 "question": [
   [{"role":"user","content":"Move 'final_report.pdf' ..."}],
   [{"role":"user","content":"Perform a detailed search using grep ..."}],
   [{"role":"user","content":"... sort the 'final_report.pdf' by line ..."}],
   [],   // ← 空轮! 上一轮的 sort 函数被 "missed_function" 表抽掉了
   [{"role":"user","content":"Move 'previous_report.pdf' ... juxtapose ..."}]],
 ...,
 "missed_function": {"3": ["sort"]},  // ← 第 3 轮 sort 不可见
 "excluded_function": ["cp"]}

难点: 第 3 轮(0-indexed)给模型的可用工具列表里少了 sort,所以模型应当回答"对不起,我没有 sort 工具"而不是 hallucinate 一个调用。question 数组里出现 [] 空轮 = "本轮模型不应该有用户输入,系统让模型自由发挥"。

3.2.5 multi_turn · `multi_turn_miss_param` — 必须追问

// BFCL_v4_multi_turn_miss_param.json (第 1 条 — 注意倒数第二轮)
{"id": "multi_turn_miss_param_0",
 "question": [
   ...,
   [{"role":"user","content":"Move one of the file in document directory to temp ...
                              proceed to juxtapose it with 'previous_report.pdf' ..."}],
   //  ↑ 用户说 "one of the file" 但没指明哪一个
   [{"role":"user","content":"The specific file is previous_report.pdf."}]],
   //  ↑ 模型本应该追问,然后用户在下一轮给出答案
 ...}

难点: 模型如果"猜"了一个文件名直接 mv,会失败;正确行为是 asking-for-clarification(回 user 消息,不发函数调用)。这是 V3 加进来检测 "agreeable hallucination" 的核心 category。

3.2.6 agentic · `web_search`(V4 新)— multi-hop QA + SerpAPI

// BFCL_v4_web_search.json (第 1 条)
{"id": "web_search_0",
 "question": [[{"role":"user","content":
   "Some countries are known for producing luxury goods, including the world's most
    expensive tea. In April 2025, who is the richest billionaire (according to Forbes)
    from the country that produces the most expensive tea?"}]],
 "involved_classes": ["WebSearchAPI"]}

// possible_answer/BFCL_v4_web_search.json (第 1 条 — 注意 source 是 3-hop 推理)
{"id": "web_search_0",
 "ground_truth": ["Zhang Yiming"],   // ← 但 wikipedia 2025 第一名其实是 Zhong Shanshan
 "source": [
   {"subquestion":"Most expensive tea in the world",
    "answer":"DaHong Pao", "source":"https://camellios.com/..."},
   {"subquestion":"Country that produces DaHong Pao",
    "answer":"China", "source":"https://en.wikipedia.org/wiki/Da_Hong_Pao"},
   {"subquestion":"Who is the richest billionaire in China",
    "answer":"Zhong Shanshan", "source":"https://www.forbes.com/..."}],
 "num_hops": 3}

暴露的工具签名(multi_turn_func_doc/web_search.json):

search_engine_query(keywords, max_results=10, region="wt-wt") — 走 SerpAPI
fetch_url_content(url, mode="raw"|"markdown"|"truncate") — fetch + 解析网页

难点 & 数据观察:

本题 ground_truth 是 "Zhang Yiming",但 source 链条最终答案写的是 "Zhong Shanshan" —— 这是 leaderboard 数据库里实际不一致的样例,说明这类题答案随时间会漂(2024 是 Zhong Shanshan,2025 Forbes 又一变)。所以 BFCL 把 web_search 限定在 "April 2025" 这种时间锚提问,但人工标的 ground_truth 仍可能滞后。
web_search_no_snippet 子 category 把 SerpAPI 返回的 snippet 字段抠掉,强制模型 fetch_url_content 实际打开网页 —— 不能"看摘要凑答案"。
跑这个 category 需要付费 SerpAPI key,这是 BFCL 第一次把"非免费基础设施"引入 leaderboard。

3.2.7 agentic · `memory`(V4 最大新增)— 全面拆解

这一节是本次 update 的重点 — 用户特别要求"尤其是 memory 是什么任务"。

Memory 任务的 2-stage 架构(关键!):

Stage 1 — prereq 阶段(模型不可见,由 evaluator 预跑): 同 scenario 的 memory_prereq_conversation/memory_<domain>.json 是一段长达 6-30 轮的用户独白(顾客/学生/病人/财务/记事 5 种 domain)。evaluator 把这段独白送进模型,让模型用 memory API 自行把信息存进 memory。这一阶段的 memory dump 会被持久化到 snapshot_folder/<test_id>.json。
Stage 2 — 实际计分(打榜看的就是这一阶段): 把 Stage 1 持久化的 memory 重新加载,然后给模型一个单轮问题(BFCL_v4_memory.json 里的 question),问 "What is my first name?" / "How old am I?" / "What kind of latte do I occasionally like?"。模型必须用 memory_search / memory_retrieve 把答案从 memory 中查出来,并以最终自然语言答复。Ground truth 是字符串 set,如 ["35","thirty five"] 都接受。

5 个 domain(verbatim from data folder):

domain key	场景	prereq 独白主题
`customer`	Customer Support	顾客 Michael 投诉 espresso machine 受损、配件、shipping 等(30 题)
`healthcare`	Healthcare Patient	病人病史 / 用药 / 急诊经历(30 题)
`finance`	Finance Managing Director	财务负责人对 portfolio 操作 / 客户记录
`notetaker`	Personal To-Do List	日常事项 / 计划
`student`	College Student Advising	学生选课 / GPA / 导师沟通

样例 verbatim — Stage 2 问题(BFCL_v4_memory.json 前 5 条):

{"id":"memory_0-customer-0","question":[[{"role":"user","content":"What is my first name?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_1-customer-1","question":[[{"role":"user","content":"How old am I?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_2-customer-2","question":[[{"role":"user","content":"Where do I live?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_3-customer-3","question":[[{"role":"user","content":"What kind of latte do I occasionally like?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_4-customer-4","question":[[{"role":"user","content":"How many square feet is my kitchen counter?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}

对应的 ground truth + source 出处(possible_answer/BFCL_v4_memory.json):

{"id":"memory_0-customer-0","ground_truth":["Michael"],
 "source":"My name is Michael, and this is my first time interacting with your company..."}
{"id":"memory_1-customer-1","ground_truth":["35","thirty five"],
 "source":"I'm 35 years old, live in Seattle..."}
{"id":"memory_2-customer-2","ground_truth":["Seattle"],
 "source":"I'm 35 years old, live in Seattle..."}
{"id":"memory_3-customer-3","ground_truth":["strawberry matcha"],
 "source":"...I occasionally like a strawberry matcha latte..."}
{"id":"memory_4-customer-4","ground_truth":["38","thirty eight"],
 "source":"...my counter is only 38 square feet."}

Stage 1 的 prereq 独白长这样(memory_prereq_conversation/memory_customer.json 节选):

{"id":"memory_prereq_0-customer-0","topic":"First-Time Inquiry About a Product",
 "question": [
   [{"role":"user","content":"Hey there! Thanks for getting back to me ...
                              My name is Michael, and this is my first time
                              interacting with your company in any way..."}],
   [{"role":"user","content":"I'm 35 years old, live in Seattle, and am
                              pretty serious about both my work and my hobbies.
                              I work as a freelance graphic designer..."}],
   ... // 30+ 轮长独白
 ]}

注意:Stage 1 的每一轮都是 user → assistant,assistant 边读边决定要不要调 core_memory_add / archival_memory_add 来记下"Michael / 35 / Seattle / strawberry matcha / 38 sqft"等关键事实。Stage 1 模型的记什么 / 不记什么会直接决定 Stage 2 能不能查出来。

3.2.7.A memory_kv 子类 — key-value 后端

来源: eval_checker/multi_turn_eval/func_source_code/memory_kv.py 类 MemoryAPI_kv。设计灵感来自 MemGPT —— core memory(短期,容量 7 条,每条 ≤300 字符)+ archival memory(长期,容量 50 条,每条 ≤2000 字符)双层结构。

# memory_kv.py verbatim
MAX_CORE_MEMORY_SIZE = 7
MAX_CORE_MEMORY_ENTRY_LENGTH = 300
MAX_ARCHIVAL_MEMORY_SIZE = 50
MAX_ARCHIVAL_MEMORY_ENTRY_LENGTH = 2000

class MemoryAPI_kv(MemoryAPI):
    """A class that provides APIs to manage short-term and long-term
       memory data in a key-value format."""

    def core_memory_add(self, key: str, value: str) -> Dict[str, str]:
        """Add a key-value pair to the short-term memory.
           Keys must be snake_case and cannot contain spaces."""
        # 检查容量 / 长度 / key 格式 / 重复
        ...

    def core_memory_remove(self, key: str) -> Dict[str, str]: ...
    def core_memory_replace(self, key: str, value: str) -> ...
    def core_memory_clear(self) -> ...
    # 同名一组 archival_memory_* 方法
    def archival_memory_key_search(self, query: str, k: int = 5):
        """Search for key names ... using BM25+ algorithm."""

暴露给模型的 16 个工具签名(multi_turn_func_doc/memory_kv.json): core_memory_add / remove / replace / clear / retrieve / key_search / list_keys + 同名一组 archival_memory_*。注意 key 必须 snake_case,这是 V4 故意加的 format 约束,逼模型规范命名。

难点: core memory 只能放 7 条 → 模型必须做 importance ranking(决定 "Michael 名字" 进 core 还是 archival),否则容易溢出。检索是 BM25+ 词袋,所以同义词检索能力差(写入是 customer_first_name,查 "name" 能命中,查 "user" 就不一定)。

3.2.7.B memory_vector 子类 — 向量库后端

来源: memory_vector.py 类 MemoryAPI_vector,内部用 sentence-transformers all-MiniLM-L6-v2 + FAISS(verbatim 见下)。

# memory_vector.py verbatim
ENCODER = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
ENCODER_DIM = ENCODER.get_sentence_embedding_dimension()

class MemoryAPI_vector(MemoryAPI):
    def __init__(self):
        self.core_memory = VectorStore(max_size=7, max_entry_length=300)
        self.archival_memory = VectorStore(max_size=50, max_entry_length=2000)

    def core_memory_add(self, text: str) -> dict[str, str]:
        """Add a new entry to the core memory.
           Returns: id (int): The ID of the added entry."""
        return self.core_memory.add(text)

    def core_memory_retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve the most similar entries from the core memory."""
        # FAISS L2 nearest-neighbor search

区别 vs kv: 不需要 key —— core_memory_add(text) 只传文本,系统自动 encode 并返回数字 id;检索靠 cosine/L2 similarity 不是关键词。

难点: 模型不能用"key 提示"自己,完全依赖 embedding 召回。Stage 1 写入时如果 batch 多条信息进一条 entry(如 "name=Michael, age=35, city=Seattle"),Stage 2 任何相关 query 都能拉回但 top-k 噪声大;反之每条只存一个事实,Stage 2 又可能 top-5 召不到。这就是经典 RAG chunk-size tradeoff。

3.2.7.C memory_rec_sum 子类 — recursive summarization 后端

来源: memory_rec_sum.py 类 MemoryAPI_rec_sum。这一个最特殊:memory 就是一个 plain string,容量 10000 字符,只有 5 个工具。

# memory_rec_sum.py verbatim
MAX_MEMORY_ENTRY_LENGTH = 10000  # 10k characters

class MemoryAPI_rec_sum(MemoryAPI):
    def __init__(self):
        self.memory = ""   # ← 单一字符串,不是 dict / 不是 list

    def memory_append(self, text: str) -> Dict[str, str]:
        """Append a new text to the end of the memory."""
        # 长度溢出就报错,逼模型 summarize

    def memory_update(self, text: str) -> Dict[str, str]:
        """Update the memory with new text.
           This will replace the existing memory content."""

    def memory_replace(self, old_text, new_text) -> ...
    def memory_clear(self) -> ...
    def memory_retrieve(self) -> ...  # 拿回整段字符串

核心 design intent: 这就是 Claude 风格的 "rolling summary" memory。模型必须每隔 N 轮把现有 memory 取出来、压成更短的 summary,再 memory_update 写回 —— "递归"二字就在于反复 retrieve → summarize → overwrite。memory_replace(old, new) 是细粒度修订,等价于 sed。

难点: 没有索引,Stage 2 查询时只能拿回 整段 10k 字符串 然后塞进 context 让 LLM 自己 grep —— 这也是为什么这个子类的 token cost 通常最高(每次 retrieve 都把全部 memory 灌回 prompt)。但反之"召回率"理论 100%(整段都给你),所以 leaderboard 上 rec_sum 经常分数偏高但 latency / cost 最差。这对应 #29 EnvTuning paper 中提到的 RecSum / KV / Vector 三分,与该 paper 的 "RL-based memory management" 直接对照。

3.2.7.D Memory 评测的判分

memory 任务和 multi-turn 用同一个 state-based checker(multi_turn_checker.py),但额外多两步:

Stage 1 snapshot 必须能成功生成 —— 如果 Stage 1 模型从头到尾不调 *_add,Stage 2 拿到空 memory,几乎必 0。
Stage 2 最终回复包含 ground_truth 子串 —— 这是少数用 string-contains 而不是 AST 的 category(因为最终答案是 NL 回答),但 ground_truth 是 set,如 ["35","thirty five"] 任一命中即可。

3.2.8 diag · `format_sensitivity`(V4 新,不进总分)

同一个 multi_turn_base 任务,但 system prompt 用多种格式扰动模板包装(JSON vs YAML vs XML vs Markdown,有无 emoji,大小写等),看模型在不同模板下答案稳定性如何。结果体现在 leaderboard 的 FmtΔ 列(format gap)—— Claude Opus 4.5 prompt-mode 在这一列跌 44 pp(见 §6.2),说明 prompting-mode 模型对 system prompt 模板极度敏感,而 FC handler 几乎不受影响。

§4 评测方法

BFCL 同时支持两种核心评测,V3 起又增加 state-based + response-based 双重 multi-turn 评测。

4.1 AST evaluation(V1 起,所有 single-turn 用)

AST checker 的核心 dispatch(verbatim 自 bfcl_eval/eval_checker/ast_eval/ast_checker.py):

def ast_checker(
    func_description,
    model_output,
    possible_answer,
    language: Language,
    test_category: str,
    model_name: str,
):
    if "parallel" in test_category:
        return parallel_function_checker_no_order(
            func_description, model_output, possible_answer, language, model_name
        )
    elif "multiple" in test_category:
        return multiple_function_checker(
            func_description, model_output, possible_answer, language, model_name
        )
    else:
        return simple_function_checker(
            func_description[0], model_output[0], possible_answer[0], language, model_name
        )

三个 sub-checker 都基于真正解析模型输出为 AST(Python: ast.parse;Java/JS: 各自的 tree-sitter 风格 parser),然后逐层匹配:

函数名匹配 —— 调对函数没?
参数名匹配 —— 必填参数都给了没?
类型匹配 —— int 不能给成 str(strict on typing)。但有个例外: int → float Python 自动转,允许放宽(CHANGELOG 2024-06-07 #407 加入)。
值匹配 —— 落在 possible_answer 的合法值集合里就算对(允许 multi-ground-truth)。
String standardization(2024-04-03 #309 引入)—— 去白空格 + 去 ,./-_*^ 子集标点,让 AST 匹配更鲁棒。

对 parallel 类用 no-order matching(可以乱序,数量必须对);multiple 用 all-or-nothing("strict all-or-nothing for multiple/parallel scenarios" — blog 8 verbatim)。

4.2 Executable evaluation(V1 引入,V4 在 2025-04-09 #943 整体退役)

原来的 exec_simple / exec_parallel / exec_multiple / exec_parallel_multiple / rest 真的会去 import 一个函数包并执行,然后用四种 criteria 之一:

exact_match — 严格相等
structural_match (type match) — list 比长度,dict 比 key
real_time_match — 数值结果允许 20% 误差(用来兼容 API 实时数据微变化)
API sanity check — 每次跑 eval 先 ping 一遍 API 健不健康

注意 — V4 已经退役 executable category。 CHANGELOG 2025-04-09 #943 verbatim: "Retire the executable categories from the leaderboard. The following categories will be excluded from the evaluation pipeline: rest, exec_simple, exec_parallel, exec_multiple, exec_parallel_multiple." 退役的核心原因是 真实 API 太不稳定(weather/news/finance API 几乎每周都坏一次),AST + state-based 更可控。

4.3 State-based evaluation(V3 起,所有 multi-turn 用)

这是 V3 最大的方法学创新。整套 multi-turn 包含一个 backend simulator(GorillaFileSystem / TravelAPI / MessageAPI / VehicleControlAPI / TicketAPI / TradingBot / MathAPI)。模型每发一个工具调用,backend 实际执行并改变 state。评测时(blog 13 verbatim):

"State-based evaluation: Compares backend system attributes after all function calls complete each turn, capturing correctness of write/delete operations.

Response-based evaluation (subset-matched): Validates execution paths against minimal viable ground truth trajectories, ensuring read-only requests succeed while allowing reasonable alternative paths rather than requiring exact trajectory matching.

Human-labeled ground truth trajectories underpin all evaluation."

关键设计点:

State diff 而非 trajectory match — 只要最终 state 对了,中间走多远的路都行。这样允许 exploration / error recovery。
Read-only 操作单独 subset-match — read-only 不修改 state,所以只检查"是否调对了"。
Human-labeled GT — 不用 LLM judge,因此可复现。这是它和 MCP-Atlas / MCP-Universe 的分水岭(后两者要么 Gemini judge 要么真 GCP env,都不便宜)。

4.4 Hallucination detection(贯穿所有版本)

irrelevance / live_irrelevance / multi_turn_miss_func 三类都是测拒答能力 —— function doc 里没有合适工具时,模型应该用自然语言回答而不是硬编一个 function name。Hallucination = 在没有合适工具的时候编出一个。

§5 开源现状 — repo / dataset / 评测脚本 / 自托管

组件	位置	状态
代码仓库	ShishirPatil/gorilla / berkeley-function-call-leaderboard/	主目录,Apache-2.0
PyPI 包	`pip install bfcl-eval`	2025-06-08 #1054 发布,注意: PyPI 上有个 unrelated 的 `bfcl` 包,别装错
HF dataset	gorilla-llm/Berkeley-Function-Calling-Leaderboard	11.7 MB,49,556 下载/月,Apache-2.0。不能用 `load_dataset()`,得手工 jsonl 读
Live leaderboard	gorilla.cs.berkeley.edu/leaderboard.html	每月更新,最后 commit 2026-04-12 (f7cf735)
评测脚本	`bfcl_eval/eval_checker/`	`ast_eval/ast_checker.py` + `multi_turn_eval/` + `executable_eval/` (legacy)
模型 handler	`bfcl_eval/model_handler/`	base_handler.py + 每个 model 一个 handler 文件
License	Apache-2.0(README 末尾:"All the leaderboard statistics, and data used to train the models are released under Apache 2.0")	训练数据可商用

5.1 三步自托管完整流程(README verbatim)

# 1. 装环境
conda create -n BFCL python=3.10
conda activate BFCL
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .

# 2. 生成 model 响应(以 OSS Qwen3-8B 为例)
bfcl generate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9

# 3. 评测
bfcl evaluate --model Qwen/Qwen3-8B-FC --test-category multi_turn

# 输出: score/Qwen/Qwen3-8B-FC/multi_turn/*.json
#       score/data_overall.csv
#       score/data_multi_turn.csv

关键 flag:

--backend {sglang|vllm}: sglang 在 multi-turn 用例下"much faster",但只支持 SM80+(Ampere 起);老 GPU 用 vllm
--local-model-path: 离线推理 — 指向预下载好的 ckpt 目录
--enable-lora --lora-modules name="path": 直接评测 LoRA adapter 微调模型
--run-ids + test_case_ids_to_generate.json: 只跑指定 ID,迭代时省时间
--partial-eval: 评测部分 entry(注意不能和官方 leaderboard 数字直接对比)
--skip-server-setup + LOCAL_SERVER_ENDPOINT/LOCAL_SERVER_PORT: 接 SLURM 上已经在跑的 vllm 服务
REMOTE_OPENAI_BASE_URL / REMOTE_OPENAI_API_KEY: 接 RunPod / ngrok / 企业网关

§6 完整 leaderboard — V4 全量(2026-05 quote)

数据源(已升级 — 直接拉官方 CSV): 上一版数据来自第三方镜像存在 1-3 pp 漂移;本版直接 curl https://gorilla.cs.berkeley.edu/data_overall.csv(及 data_live.csv / data_non_live.csv / data_multi_turn.csv / data_agentic.csv / data_format_sensitivity.csv),拿到的是官方 leaderboard JS 渲染时所读的 同一份 CSV(JS 代码 verbatim: const csvFilePath = `./data_${datasetName}.csv`)。截止 2026-05-13 抓取,共 109 个 model row,这是当前官方最权威的 V4 数据。下面分组展示全量。

6.1 表头解读 — V4 列含义

官方 CSV 共 36 列,本文聚焦下面 8 个语义维度(其它如 cost / latency 在 §7 提交流程那块用):

列	含义	权重(V4 新公式)
`Overall Acc`	官方总分(下面 5 段加权平均)	—
`Non-Live AST`	AST 对比 ground truth call(BFCL V1 起的核心)	10%
`Live`	用户贡献的真实场景 prompt(V2 引入)	10%
`Multi Turn`	4 子类(base / miss_func / miss_param / long_context)	30%
`Web Search`	V4 新 — base + no_snippet 两种网络搜索	(含在 40% agentic)
`Memory`	V4 新 — KV / Vector / Recursive Summarization 三种后端	(含在 40% agentic)
`Irrelevance Detection`	"该拒答时拒答"的能力	10%
`Format Sensitivity Max Delta`	5 种 system prompt 变体下分数的最大跌幅(诊断 only,不计分)	—

6.2 (a) Proprietary frontier 模型 — V4 全量

来自 data_overall.csv 2026-05-13 截图,所有 Anthropic / OpenAI / Google / xAI / Cohere / Mistral / Amazon proprietary 模型(已剔除小数;FormatΔ N/A 表示官方未跑此 diagnostic):

Rank	Model	Org	Overall	NonLive	Live	MultiTurn	WebSrch	Memory	Irrel	FmtΔ
1	Claude-Opus-4-5-20251101 (FC)	Anthropic	77.47	88.58	79.79	68.38	84.50	73.76	84.72	N/A
2	Claude-Sonnet-4-5-20250929 (FC)	Anthropic	73.24	88.65	81.13	61.37	81.00	64.95	86.61	N/A
3	Gemini-3-Pro-Preview (Prompt)	Google	72.51	90.65	83.12	60.75	80.00	61.72	85.59	8.5
5	Grok-4-1-fast-reasoning (FC)	xAI	69.57	88.27	78.46	58.87	82.50	53.98	79.43	N/A
6	Claude-Haiku-4-5-20251001 (FC)	Anthropic	68.70	86.50	78.68	53.62	83.50	54.41	85.11	N/A
7	Gemini-3-Pro-Preview (FC)	Google	68.14	85.75	81.72	63.12	68.50	54.84	77.85	N/A
8	o3-2025-04-16 (Prompt)	OpenAI	63.05	81.94	73.21	62.25	50.50	51.83	83.98	8.5
9	Grok-4-0709 (Prompt)	xAI	62.97	82.75	72.54	47.00	74.00	50.54	84.30	13.0
10	Grok-4-0709 (FC)	xAI	61.38	85.38	75.57	33.88	82.00	55.91	75.40	N/A
12	Grok-4-1-fast-non-reasoning (FC)	xAI	58.29	88.13	77.94	46.75	75.00	26.24	74.09	N/A
13	Command A Reasoning (FC)	Cohere	57.06	86.27	78.61	50.12	55.50	28.82	86.75	N/A
15	Gemini-2.5-Flash (FC)	Google	56.24	84.96	74.39	36.25	59.00	41.29	93.67	N/A
16	GPT-5.2-2025-12-11 (FC)	OpenAI	55.87	81.85	70.39	28.12	75.50	45.81	79.42	N/A
17	GPT-5-mini-2025-08-07 (FC)	OpenAI	55.46	69.85	58.62	27.50	82.00	44.30	91.01	N/A
20	GPT-4.1-2025-04-14 (FC)	OpenAI	53.96	82.79	69.95	38.88	68.00	23.87	86.52	N/A
21	o4-mini-2025-04-16 (FC)	OpenAI	53.24	37.73	66.10	41.75	75.50	34.19	83.91	N/A
24	GPT-5-nano-2025-08-07 (FC)	OpenAI	51.45	68.00	59.44	34.50	72.50	24.73	89.10	N/A
26	Gemini-2.5-Flash (Prompt)	Google	50.90	88.08	78.16	16.75	62.00	38.71	91.09	9.0
27	GPT-4.1-mini-2025-04-14 (FC)	OpenAI	50.45	83.83	68.84	34.13	57.00	26.88	81.69	N/A
28	o4-mini-2025-04-16 (Prompt)	OpenAI	50.26	81.29	70.76	16.62	71.50	35.27	87.16	9.5
30	o3-2025-04-16 (FC)	OpenAI	48.56	40.38	66.17	14.75	77.00	47.31	86.13	N/A
35	Command A (FC)	Cohere	46.49	87.56	78.53	29.50	46.50	16.56	84.19	N/A
38	GPT-5.2-2025-12-11 (Prompt)	OpenAI	45.27	78.29	67.14	43.75	40.50	3.87	87.26	13.0
46	mistral-large-2411 (FC)	Mistral AI	38.37	84.65	81.87	14.12	28.00	24.95	68.92	N/A
49	Mistral-Medium-2505 (FC)	Mistral AI	37.56	67.44	67.95	10.75	35.00	23.01	91.95	N/A
52	Gemini-2.5-Flash-Lite (FC)	Google	36.87	86.60	65.80	13.50	21.00	20.65	92.50	N/A
57	Claude-Opus-4-5-20251101 (Prompt)	Anthropic	33.47	89.65	76.02	16.12	13.00	1.94	90.75	13.0
61	Command R7B (FC)	Cohere	32.07	80.96	69.06	8.25	27.00	5.16	81.65	N/A
76	palmyra-x-004 (FC)	Writer	27.87	87.46	77.87	0.38	2.50	13.12	80.99	N/A
80	Amazon-Nova-2-Lite-v1:0 (FC)	Amazon	27.10	86.96	80.83	2.12	5.00	2.37	82.11	N/A
88	Amazon-Nova-Pro-v1:0 (FC)	Amazon	24.97	86.58	78.53	1.88	2.50	1.94	70.06	N/A

frontier 上重要观察:

Anthropic 用 FC handler 翻盘: Claude-Opus-4-5 (FC) = 77.47 抢回 V4 第一 vs Claude-Opus-4-5 (Prompt) = 33.47 — 同一 weights, prompt-mode 跌 44 pp。这就是 §6 章首"FormatΔ"的极端反例。前一版的"Claude V3 25.3 → V4 ~70"分析这次得到官方 CSV 验证。
GPT-5.2 在 Prompt+Thinking 之外 multi-turn 大幅退化: GPT-5.2 (FC) MultiTurn 28.12 比 Claude Opus 4.5 (FC) 的 68.38 差 40 pp。这是过去 6 个月最大反直觉 — 业界预期"reasoning 模型 multi-turn 更稳",数据完全相反。
o3 (Prompt) 63.05 vs o3 (FC) 48.56: OpenAI 自家 FC handler 在 o3 上反而拖后腿 14 pp,与 Anthropic FC 翻盘正相反。两家 FC API 设计的 trade-off 公开可见。

6.3 (b) Open-weight ≥40B 模型

Rank	Model	Org	License	Overall	NonLive	Live	MultiTurn	WebSrch	Memory	Irrel
4	GLM-4.6 (FC thinking)	Zhipu AI	MIT	72.38	87.56	80.90	68.00	77.50	55.70	84.96
11	Kimi-K2-Instruct (FC)	MoonshotAI	modified-MIT	59.06	81.60	78.68	50.63	66.50	29.03	87.34
14	DeepSeek-V3.2-Exp (Prompt+Thinking)	DeepSeek	MIT	56.73	85.52	76.02	44.88	58.00	44.09	67.00
19	DeepSeek-V3.2-Exp (FC)	DeepSeek	MIT	54.12	34.85	53.66	37.38	69.50	54.19	93.18
23	Qwen3-235B-A22B-Instruct-2507 (Prompt)	Qwen	Apache-2.0	52.15	90.33	78.68	44.62	50.50	19.35	78.89
31	Qwen3-235B-A22B-Instruct-2507 (FC)	Qwen	Apache-2.0	47.99	37.40	68.91	45.38	54.00	23.87	81.73
50	Llama-4-Maverick-17B-128E-FP8 (FC)	Meta	Llama 4 Community	37.29	88.65	73.65	20.25	28.00	18.92	55.97
62	Llama-3.3-70B-Instruct (FC)	Meta	Llama 3 Community	31.90	88.02	76.61	21.50	10.00	8.17	53.53
72	Llama-4-Scout-17B-16E (FC)	Meta	Llama 4 Community	28.13	89.38	74.69	9.00	14.50	8.17	44.92
74	CoALM-70B	UIUC + Oumi	Llama 3 Community	27.99	83.44	67.28	10.62	0.00	5.81	85.65
108	Llama-3.1-Nemotron-Ultra-253B-v1 (FC)	NVIDIA	NVIDIA Open	10.00	0.00	0.00	0.00	0.00	0.00	100.00

注意 GLM-4.6 (MIT) 是当前 V4 open-weight 第一,72.38% 紧追 Claude Opus 4.5 的 77.47%。Z AI 这个模型的 BFCL 数字几乎能与 Anthropic 第一打平,这是 2026 H1 open-weight 最大新闻之一。Kimi-K2-Instruct(modified-MIT)59.06 是次席。注意 Llama-3.1-Nemotron-Ultra-253B-v1 (FC) 在官方 CSV 上是 10%,几乎为 0 的 NonLive/Live/MultiTurn ——这不是数据错误,而是 handler 不兼容/output 全 reject,Irrelevance 飙到 100% 就是"什么都拒答"的退化解(见 §6.5)。

6.4 (c) Open-weight ≤40B 模型 — 训 ≤40B 最关心的子榜

Rank	Model	Params	License	Overall	NonLive	Live	MultiTurn	WebSrch	Memory	Irrel
25	Nanbeige4-3B-Thinking-2511 (FC)	3B	Apache-2.0	51.40	81.58	79.42	51.12	21.50	36.77	83.09
29	Qwen3-32B (FC)	32B dense	Apache-2.0	48.71	88.77	82.01	47.87	21.50	26.67	76.37
32	Nanbeige3.5-Pro-Thinking (FC)	—	Apache-2.0	47.68	38.35	69.95	40.00	42.00	45.16	74.20
33	Qwen3-32B (Prompt)	32B dense	Apache-2.0	46.78	90.27	82.01	43.25	26.00	15.70	82.39
36	BitAgent-Bounty-8B	8B	Apache-2.0	46.23	81.60	93.12	62.38	0.00	1.51	97.48
39	Qwen3-8B (FC)	8B	Apache-2.0	42.57	87.58	80.53	41.75	12.00	14.62	79.07
40	ToolACE-2-8B (FC)	8B	Apache-2.0	42.44	87.10	77.42	38.38	8.50	18.49	90.79
41	Qwen3-30B-A3B-Instruct-2507 (FC)	30B/A3B	Apache-2.0	41.39	85.77	77.94	30.00	22.50	17.63	79.90
43	Qwen3-14B (FC)	14B	Apache-2.0	41.03	84.94	80.01	34.75	10.00	19.57	81.94
54	Qwen3-4B-Instruct-2507 (FC)	4B	Apache-2.0	35.68	87.88	76.39	22.12	3.00	17.63	84.93
66	Gemma-3-12b-it (Prompt)	12B	Gemma TOU	30.43	79.44	74.24	5.75	4.00	27.53	70.29
69	Gemma-3-27b-it (Prompt)	27B	Gemma TOU	29.47	87.17	74.54	10.75	0.00	13.55	73.67
70	Phi-4 (Prompt)	14B	MIT	28.79	69.56	60.70	3.88	4.50	24.73	87.55
81	Granite-3.1-8B-Instruct (FC)	8B	Apache-2.0	27.10	78.33	60.33	7.50	0.50	14.41	79.98
82	Falcon3-10B-Instruct (FC)	10B	Falcon-LLM	27.01	85.00	75.43	6.50	1.50	27.53	32.09
83	Granite-3.2-8B-Instruct (FC)	8B	Apache-2.0	26.87	79.77	60.33	7.38	0.50	12.47	80.53
85	Llama-3.1-8B-Instruct (Prompt)	8B	Llama 3 Community	25.83	84.00	70.76	11.12	3.00	10.75	42.70
86	MiniCPM3-4B-FC (FC)	4B	Apache-2.0	25.55	81.75	65.21	3.88	0.00	12.04	72.84
93	Granite-20b-FunctionCalling (FC)	20B	Apache-2.0	23.23	82.35	58.70	5.38	0.00	0.00	75.13
103	Granite-4.0-350m (FC)	0.35B	Apache-2.0	18.98	67.92	46.11	2.50	0.50	3.23	60.84

Top 3 ≤40B open-weight on V4 (verbatim from official CSV):

Nanbeige4-3B-Thinking-2511 (FC) — 51.40% · 3B Apache-2.0 · 这是 2026-05 最大惊喜:3B 模型上 51.40%, multi-turn 51.12% 超过 Qwen3-32B 的 47.87%。
Qwen3-32B (FC) — 48.71% · 32B Apache-2.0
Nanbeige3.5-Pro-Thinking (FC) — 47.68% · Apache-2.0

6.5 (d) Specialized fine-tuned tool models

Rank	Model	Params	License	Overall	NonLive	Live	MultiTurn	WebSrch	Memory	Irrel
18	xLAM-2-32b-fc-r (FC)	32B	cc-by-nc-4.0	54.66	89.60	75.50	69.50	25.50	20.86	80.23
22	Llama-xLAM-2-70b-fc-r (FC)	70B	cc-by-nc-4.0	53.07	88.44	72.17	77.38	15.00	14.41	79.11
34	Llama-xLAM-2-8b-fc-r (FC)	8B	cc-by-nc-4.0	46.68	84.58	67.95	70.00	6.50	13.98	63.28
37	Arch-Agent-32B	32B	katanemo-research	45.37	88.92	80.68	54.25	5.00	14.62	82.15
42	xLAM-2-3b-fc-r (FC)	3B	cc-by-nc-4.0	41.22	82.96	62.92	58.38	2.50	11.40	63.45
56	Arch-Agent-3B	3B	katanemo-research	35.36	86.67	72.91	34.88	0.50	6.88	74.67
60	Arch-Agent-1.5B	1.5B	katanemo-research	32.14	82.67	67.73	26.62	0.00	8.17	74.83
64	Hammer2.1-7b (FC)	7B	cc-by-nc-4.0	31.67	85.50	69.50	23.87	0.00	0.00	90.12
65	xLAM-2-1b-fc-r (FC)	1B	cc-by-nc-4.0	30.44	69.04	55.14	36.00	0.00	3.87	64.47
68	Hammer2.1-3b (FC)	3B	qwen-research	29.71	84.96	70.54	16.50	0.00	3.01	86.12
75	Hammer2.1-1.5b (FC)	1.5B	cc-by-nc-4.0	27.88	82.98	69.50	15.62	0.00	0.00	79.40
84	CoALM-8B	8B	Llama 3 Community	26.81	84.87	66.77	8.00	0.00	2.80	86.90
100	Hammer2.1-0.5b (FC)	0.5B	cc-by-nc-4.0	21.22	65.98	54.63	2.88	0.00	1.08	80.79

specialized 模型的失败模式可以从 CSV 直接看出:

xLAM-2-70b multi-turn 77.38% 全榜第一 (高于 Claude Opus 4.5 的 68.38%) — 但 WebSearch 只有 15.00, Memory 14.41 — 经典"专精 multi-turn, 没见过 agentic"。这就是 §11 即将讲到的 RL/SFT 偏置: APIGen-MT 数据集是 multi-turn 合成,几乎没有 web/memory。
Hammer2.1 系列 WebSearch / Memory 全是 0。这不是 bug —— Hammer 论文(ICLR 2025 Spotlight)显式声称 on-device + function masking,数据集里完全没有 agentic 任务,V4 加 40% agentic 是对它最不利的改版。
BitAgent-Bounty-8B Live 93.12% / Irrel 97.48% 极端高,但 WebSearch 0 / Memory 1.51 — 又一个 single-turn 过拟合的极端例子。

6.6 ≤40B open-weight 子榜历史比对(沿用上一版)

对 #27 小模型 MCP landscape 那群读者,把上面 V4 数据接到我们已有的 V3 / 自报数字上,得到这张跨版本子榜:

模型	params	BFCL V3 (旧)	BFCL V4 (官方 CSV)	License
Nanbeige4-3B-Thinking-2511	3B	—	51.40 ✨	Apache-2.0
Qwen3-32B (FC)	32B dense	75.7 (V3 镜像)	48.71	Apache-2.0
Qwen3-30B-A3B-Instruct-2507 (FC)	30B / 3B-active MoE	~70	41.39	Apache-2.0
TOUCAN-Qwen2.5-32B (#22)	32B dense	70.45	(未单独上 V4 — 仅 dataset)	Apache-2.0 (data+ckpt)
Qwen3-8B + EnvScaler (#23)	8B	—	BFCL-MT 41.88	MIT
AWM-Qwen3-Thinking-8B (#18)	8B	53.83 → 65.94 OOD	—	—
Salesforce/xLAM-2-32b-fc-r	32B	~62	54.66	cc-by-nc-4.0
ToolACE-2-8B	8B	~55	42.44	Apache-2.0
BitAgent-Bounty-8B	8B	~54	46.23	Apache-2.0
Phi-4	14B	40.8	28.79	MIT

V4 数字大面积比 V3 低 10-30 pp,这是 V4 把 multi-turn+agentic 加权 70% 的直接结果(V3 是 33%)。注意 Nanbeige4-3B 在 V4 上爆冷拿 ≤40B 第一;Qwen3-32B 不再是 ≤40B 顶点 — 是 V4 改公式后最大的"洗牌"。

§7 提交流程

BFCL 的提交流程 不像 MCPMark / Atlas 那样有 web submission portal — 它是 纯 PR-based。README "Contributing" verbatim:

"We welcome contributions! To add a new model:

Review bfcl_eval/model_handler/base_handler.py and/or bfcl_eval/model_handler/local_inference/base_oss_handler.py (if your model is hosted locally).

Implement a new handler class for your model.

Update bfcl_eval/constants/model_config.py.

Submit a Pull Request.

For detailed steps, please see the Contributing Guide."

实操步骤:

fork ShishirPatil/gorilla
在 bfcl_eval/model_handler/{api,local_inference}/ 实现一个 handler 类(参考 OpenAIHandler / AnthropicHandler / QwenHandler)
把 model name + handler 类注册到 bfcl_eval/constants/model_config.py
本地跑 bfcl generate + bfcl evaluate 出 score/data_overall.csv
提 PR,附 data_overall.csv + data_live.csv + data_multi_turn.csv
Berkeley 团队 review 后合并 + 更新 live leaderboard

注意: BFCL 不要求自报数字必须复现 —— 维护者会用同一套 commit(目前 f7cf735, 2026-04-12)在 Berkeley 自己的硬件上重测。这点比 MCPMark(#21)严格,后者接受自报但只标 "self-reported"。

§8 局限 + 与 MCP benchmark 的本质区别

8.1 BFCL 本身的几个局限

Schema-based, not stateful real-MCP — BFCL multi-turn 用的是 7 个简化的 Python class backend(GorillaFileSystem / TravelAPI / MessageAPI / VehicleControlAPI / TicketAPI / TradingBot / MathAPI),不是真实的 OAuth-authenticated MCP server。所以"BFCL 高分"不能保证模型在真实 Notion / GitHub / Stripe API 上工作。
Web Search 强绑 SerpAPI — 你想跑 V4 的 web_search category,必须有 SerpAPI key,且数字会随互联网内容漂移。这是 V4 唯一的"非确定性"评测部分。
Live category 受用户贡献质量影响 — V2 Live 数据来自社区 PR,质量不均(CHANGELOG 里反复在修单条 ground truth bug:#600 #1185 #1175 #1086 …)。
Format Sensitivity 暴露了"分数易碎" — V4 引入这一类的本意是承认:很多模型在 BFCL 上跑 60% 跑 70% 完全取决于 system prompt 怎么写。
没有真长 horizon — V3 multi-turn 主要 1-3 轮,V4 web/memory 也少有 >10 轮。对比 Toolathlon(#21)的 50+ tool 多 hop 任务,BFCL agentic 更像"短 vignette"。
评测语言/生态局限 — Java 100 + JavaScript 50 数量太少,REST 70 已被 V4 退役。BFCL 实际上是 Python-only 的 function-calling benchmark。

8.2 为什么 "BFCL 高分 ≠ MCP benchmark 高分"

这是 #21 和 #26 反复验证的命题。把核心差异提炼成一张表:

维度	BFCL	MCP-Atlas / Universe / Toolathlon / MCPMark
tool 接口	Python class API(纯 schema)	真实 MCP protocol (JSON-RPC + OAuth)
state	simple in-memory backend	真 stateful service(GCP / Notion / Stripe / Slack / Filesystem)
评测器	AST + state diff(无 LLM judge)	Atlas: Gemini-2.5-Pro claim judge · Universe: format+static+dynamic 三层 · MCPMark: verify.py 真断言 · Toolathlon: 真副作用
轮次	1-10 turn	10-50+ turn
失败模式	错调 / 错参数 / 不拒答	OAuth token 失效 / API 限流 / 半完成 state / 副作用泄漏
价格(跑一次完整 eval)	~$50 API + 几小时 GPU	$500-2000 API + 真 cloud env
"高分"含义	schema 守得住 + 状态记得清	能在真生产 stack 完成任务

所以 GPT-5 在 BFCL V4 只有 59.2 但在 MCP-Atlas 上其它数字更分散(GPT-5.4 在 MCP-Atlas 68.1%),而 Claude Opus 4 在 BFCL V3 25.3 但在 MCP-Atlas Live 是 77.3 —— 评的根本就是不同维度。从 #26 的结论上看:

"不能一个 ckpt SOTA 全部 benchmark,但可以打到各 80%。联合训配方: TOUCAN 冷启(BFCL) + MCPMark/Universe RLVR(schema 严格 verify) + Toolathlon long-horizon SFT + Atlas LoRA 隔离。"
— #26 结论 verbatim

§9 frontier model 官方引用情况

这是 #19 和 #21 已经形成的考察套路 —— 一个 benchmark 是不是"事实标准",看 frontier model card 直接引不引。

frontier model	是否在官方 card 引用 BFCL	具体引用方式
Claude Opus 4.7 (Anthropic, 2026-04-16)	官方 card 未直接引用 BFCL	Anthropic 主推 MCP-Atlas (77.3% Live);BFCL handler 由社区维护(CHANGELOG #1019 加 `claude-opus-4-1`)
GPT-5 (OpenAI, 2025-08-07)	是 — V4 内建 handler	OpenAI 没有 standalone BFCL claim,但 GPT-5 / 5-mini / 5-nano 在 V4 #1019 中作为 "New model support" 显式加入
GPT-5.5 (OpenAI, 2026 早期)	是 — system card verbatim 引用	GPT-5.5 system card: "CoT-Control includes over 13,000 tasks built from established benchmarks including BFCL (Patil et al., 2025)" — BFCL 被当成 CoT 训练源
Gemini 3.1 Pro (Google DeepMind, 2025-12)	未在 model card 显式引用	官方 card 强调 "meaningful improved tool use",但具体到 BFCL 数字没给。第三方 (awesomeagents.ai) 报 Gemini 3 Flash Preview Thinking V3 53.5%
Kimi K2.5 (Moonshot)	是 — 官方 paper 报 BFCL	K2.5 paper / blog 把 BFCL 当主线 eval; #models/06
Qwen3 系列 (Alibaba)	是 — 自报 V3 + V4	Qwen team blog 直接把 BFCL 列入 release note;Qwen3.5 各 size 单独在 BFCL V4 上报数(llm-stats.com)
GLM-4.5 / 4.7 (Z AI)	是	当前 V3+V4 双榜冠军,Z AI 官方 paper 必引
INTELLECT-3 (Prime Intellect)	是	#models/08: BFCL V3 63.5

结论: GPT-5.5 是唯一一个把 BFCL 写进 system card 的西方 frontier model(还是作为训练数据源而非 eval target)。Anthropic 和 Google 更倾向于 MCP-Atlas / 自家 agentic benchmark。但中国厂(Qwen / GLM / Kimi)+ 训练向小模型(xLAM / Hammer / TOUCAN / EnvScaler / AWM)几乎 100% 引用 BFCL。所以 BFCL 在"open-weight + ≤40B function-calling"领域几乎是法定 KPI,在frontier proprietary 领域是次选。

§10 我的 take + 实用建议

10.1 我的 take

BFCL 是 schema-level standard,不是 production agent standard。它评的是"给定一个 function spec, 模型能不能调对"。它不评"在真 OAuth Notion 上完成 task"。两者在能力树上是父子节点关系: BFCL 是 prerequisite,MCP-bench 才是 production-ready。任何 ≤40B 模型想做 function-call,先在 BFCL 拿到 ≥70%,再去碰 MCP-Atlas / Toolathlon。
V4 的权重重排很激进 — Agentic 直接 40%。 V4 把 single-turn(Live+Non-Live)总占比从 66% 砍到 20%,这意味着过去刷 V3 高分的策略(TOUCAN 1.5M SFT)在 V4 上效用打折。 V4 高分需要 (a) 良好的 multi-turn state 保持, (b) 调用 search/memory API 的连续决策, (c) 拒答 / 追问能力。这正好是 EnvScaler + SETA + AWM 类合成 env 训练的甜点。
≤40B 当前 ceiling: Qwen3-32B (V3 75.7%)。但 V4 SOTA 已经被 GLM-4.5 (70.9%) 拿走 — 注意这不是真 open-weight。下一年(2026 H2)的开源 race 大概率是 Qwen3.5 / GLM-4.7-Air / DeepSeek-V4-FC 在 V4 上的争夺。
Claude 在 BFCL 上吃亏 ≠ Claude 不会 tool use。 V3 BFCL 25.3 vs MCP-Atlas 77.3 这种巨大反差,本质是 Anthropic Tool Use API schema 与 BFCL prompt-mode 默认 schema 不兼容。这是 V4 加 format_sensitivity diag 的真正原因。如果你用 Claude 做 tool agent,记得用 claude-opus-4-1-20250805 这个 FC native handler(CHANGELOG #1019)。
BFCL 缺什么 = MCP bench 在补什么。 BFCL 没有的: OAuth / 真 API 限流 / 副作用 / 长 horizon (>10 turn) / 跨 server 调度 / 自然语言 query 含糊度。这恰恰是 MCP-Universe (#25) / MCPMark / Toolathlon / Atlas 各自切的赛道。 BFCL 的留白 = MCP bench 的机会。

10.2 训 ≤40B 模型时 BFCL 该怎么用、不该怎么用

✅ 该怎么用

SFT 冷启验证: 训完 1 个 epoch tool SFT(用 TOUCAN / xLAM 数据), 跑 BFCL V3 simple_python + multiple + parallel 三个 — 如果 ≥85%,schema 基础 OK,可以进入 multi-turn 阶段
Hallucination 早筛: irrelevance + live_irrelevance + multi_turn_miss_func 三个 category 是"拒答"专用测,任何 <50% 都说明模型乱编 function
Multi-turn 状态保持作为 gating signal: 训 RL 时把 multi_turn_base + multi_turn_miss_param 当 dev set,curriculum 推进的依据
用 V4 agentic 子集做 MCP bench 的 dry-run: 真要去测 MCPMark / Atlas 太贵,先在 BFCL V4 memory_kv + web_search_base 上跑,通过了再上真 MCP
Format sensitivity 必跑: V4 加的 format_sensitivity 不进总分但极有诊断价值 — 你的模型对 system prompt 的 5 种格式有没有 ±10% 的 variance? 有 → 还在过拟合 schema

❌ 不该怎么用

不要把 BFCL V3 Overall 当成"我的 function-call 能力总体水平"。 V4 已经把 V3 Multi-Turn 权重从 33% 砍到 30% 并加了 40% Agentic。 V3 跑 70% 在 V4 上可能跑 50%
不要为了 BFCL 而 BFCL — 不要用 BFCL train data 训测一致。 BFCL HF dataset 是 Apache-2.0 可商用,但把它当训练数据等于数据污染。用 TOUCAN / xLAM / Hammer / EnvScaler 训,BFCL 仅做 OOD eval
不要忽略 prompt-mode vs FC-mode 的差异。 BFCL leaderboard 有两个并行列(后缀 -FC = native function-calling),分数能差 5-15 pp。你部署用哪种就报哪种
不要把 web_search category 数字当严肃 eval。 SerpAPI 结果随时间漂,且模型分数和 SerpAPI key tier 强相关
不要把 BFCL 替代 MCP-Atlas / Universe。见 §8.2 表格 —— 评的根本就不是一个东西

10.3 一句话推荐

训 ≤40B function-call 模型的标准流程: TOUCAN 1.5M SFT 冷启 → BFCL V3 Overall ≥70% gate → EnvScaler/SETA 合成 env RLVR → BFCL V4 Agentic ≥50% gate → MCP-Atlas / MCPMark 真测。 BFCL 是第二关,不是终点,但跨不过它就没法继续。

§11 BFCL-targeted RL / post-train infra

这一章是 2026-05 新加的:既然 BFCL 是 ≤40B function-call 的事实标准 KPI(§9),那"什么框架/数据集/recipe 是直接以 BFCL 为优化目标训出来的?"就成了能直接落地的问题。下面把当前 stack 里所有可考的工件列出来,带 repo URL / license / 验证依据。

11.1 Gorilla 团队本身的训练代码 — 缺席

这是研究里的一个负面发现。Berkeley Gorilla 团队没有发布"训 BFCL"的公开 RL 代码。具体观察:

gorilla/openfunctions/ 只有 inference_hosted.py + inference_local.py + 一份 openfunctions_utils.py — 全是 inference / serving 代码。没有训练 loop, 没有 dataset 构建, 没有 RL reward。
gorilla/gorilla/ 只有 eval/ + inference/。和上面同样的 SFT/RL 空缺。
原始 Gorilla paper(2023, API Bench)使用 self-instruct + LLaMA-7B SFT,但训练 pipeline 未开源。 OpenFunctions v0/v1/v2 ckpt 上 HuggingFace 但不带训练脚本。
BFCL 仓库本身只有 berkeley-function-call-leaderboard/bfcl_eval/(pip install bfcl-eval),evaluation only。

结论: 想"train against BFCL",必须用第三方框架。下面是当前实际可用的 stack。

11.2 NVIDIA Tool-N1 — 唯一明确"以 BFCL 为目标"的开源 RL 训练 codebase

Field	Value
Repo	github.com/NVlabs/Tool-N1
Paper	arXiv:2505.00024 "Nemotron-Research-Tool-N1: Tool-Using LLMs with Reinforced Reasoning"(Zhang et al. 2025)
License	Apache-2.0
Underlying RL framework	verl (字节跳动 HybridFlow,GRPO 算法)
SFT framework	LLaMA-Factory + LoRA
Reward design	R1-style 二元 reward: structural validity ∧ functional correctness(无需 CoT 标注)
Eval target	BFCL + APIBank + ACEBench(repo 自带 `eval/` + BFCL handlers)
提供物	preprocessed RL+SFT datasets · training scripts · eval scripts · model ckpts

核心 claim(paper verbatim): "RL offers a more effective paradigm for enhancing the tool-calling capabilities of LLMs compared to standard supervised fine-tuning", 并在 BFCL 上展示 RL beats SFT。这是第一个 BFCL native 的端到端 RL 实证 codebase。

顶级推荐(这次研究我的 top pick): Tool-N1 是当前唯一一个 (1) 明确以 BFCL 为评估目标, (2) 完整开源 (dataset + scripts + ckpt + Apache-2.0), (3) 基于成熟 RL stack (verl + GRPO), (4) 已发表 paper 验证 effectiveness 的工件。比 xLAM (cc-by-nc-4.0, 偏 SFT) 和 Hammer (无 RL) 都齐全。

11.3 Salesforce xLAM + APIGen + ActionStudio — SFT-heavy, 数据规模最大

Field	Value
Repo	github.com/SalesforceAIResearch/xLAM (619 stars, Apache-2.0 license)
核心子目录	`actionstudio/`(EMNLP 2025 main paper, training)+ `xLAM/`(inference)
数据集 1	xlam-function-calling-60k — APIGen 合成,HF top-3 trending 2024-07
数据集 2	APIGen-MT-5k — multi-turn function calling,2025-05 公开
Paper	arXiv:2504.03601 (APIGen-MT, ICLR 2026 accepted)
Ckpt licenses	cc-by-nc-4.0(research only) — 关键限制
BFCL on V4(官方 CSV)	xLAM-2-32b-fc-r 54.66 · 70b 53.07 · 8b 46.68 · 3b 41.22 · 1b 30.44
RL or SFT	主要 SFT(actionstudio 是 unified trajectory training pipeline,不是 RL framework)

用 xLAM 训 BFCL 的最佳路径: (a) 取 xlam-function-calling-60k 做冷启 SFT, (b) 取 APIGen-MT-5k 做 multi-turn SFT, (c) 用 Tool-N1 (§11.2) 做 GRPO RL — 三段式。这是当前 BFCL multi-turn 类别最佳实证(xLAM-2 在 multi-turn 单维拿到 77.38%,§6.5)。

11.4 MadeAgents Hammer — function-masking SFT, on-device 取向

Field	Value
Repo	github.com/MadeAgents/Hammer (Apache-2.0)
Paper	arXiv:2410.04587 "Hammer: Robust Function-Calling via Function Masking"(ICLR 2025 Spotlight)
训练子目录	`train/` 仅含 `data_processing.py` — 未公开完整 trainer,需要自己接 HF trainer
Ckpt license	cc-by-nc-4.0(0.5B/1.5B/7B)· qwen-research(3B)
BFCL on V4	7B: 31.67 · 3B: 29.71 · 1.5B: 27.88 · 0.5B: 21.22(全部含 Web/Memory = 0)
定位	on-device function-calling,V4 加 agentic 后此 model family 影响最大(WebSrch/Memory 几乎全 0)

11.5 verl / SkyRL / NeMo-RL — 通用 RL 框架(可承载 BFCL reward)

框架	Repo	License	BFCL 直接支持?	说明
verl	verl-project/verl	Apache-2.0	无内置 BFCL recipe; Tool-N1 用它当 backbone	字节 HybridFlow,GRPO/PPO/DPO 全支持。2026-05 加了 zero-mismatch HF rollout。 BFCL reward 要自己写 ~100 行 (拿 `bfcl_eval` 的 AST checker 当 reward fn 即可)。
SkyRL	NovaSky-AI/SkyRL	Apache-2.0	无 BFCL recipe	UCB NovaSky,modular full-stack。子项目 SkyRL-Agent (arXiv:2511.16108) 主打 long-horizon multi-turn,理论上 BFCL multi_turn 适配 ok。
NVIDIA NeMo-RL	NVIDIA-NeMo/RL	Apache-2.0	无 BFCL recipe	NeMo 旗下,主打 DeepScaleR 类 verifiable reward。NeMo-Gym 可作 env 容器。Tool-N1 的 NVlabs 版没用 NeMo-RL,反而用 verl,这是 NVIDIA 内部组织线的有趣信号。
verl-tool	TIGER-AI-Lab/verl-tool	Apache-2.0	支持 generic tool, 无 BFCL 特化 recipe	verl 的 tool-use fork,981 stars。把 tool 调用做成 RL env,BFCL-style 任务可直接挂。
OpenPipe ART	OpenPipe/ART	Apache-2.0(9.4k stars)	无 BFCL example	最接近的是 `examples/mcp-rl/`(MCP server tool use)+ LangGraph 集成。GRPO 全支持,BFCL adaptation 不难但缺 ready-to-run notebook。

11.6 BFCL-targeted RL papers(2025-2026)

Paper	核心 idea	用 BFCL?	开源?
Tool-N1(arXiv:2505.00024, NVIDIA)	R1-style 二元 reward + GRPO + verl, 取代 SFT	主要 eval target	✅ NVlabs/Tool-N1 Apache-2.0
"Reasoning through Exploration: RL for Robust Function Calling"(arXiv:2508.05118)	EGPO — entropy-guided GRPO, CoT entropy 进入 advantage	BFCL eval	未明示 repo
RC-GRPO(arXiv:2602.03025)	Reward-Conditioned GRPO,reward token-conditioned exploration	multi-turn tool calling(含 BFCL)	?
R2IF(arXiv:2604.20316)	composite reward(format + CER + SMV)+ GRPO, 强调可解释 function calling	BFCL eval	?
TOUCAN(arXiv:2510.01179)	1.5M MCP 合成 trajectory SFT(非 RL)	BFCL V3 eval	✅ CC-BY 4.0 dataset
APIGen-MT(arXiv:2504.03601)	双 agent 模拟生成 multi-turn,xLAM-2-fc-r 训练源	BFCL + τ-bench eval	✅ APIGen-MT-5k dataset, ckpt cc-by-nc
GPT-5.5 system card "CoT-Control"	13K-task CoT 可控性训练集,BFCL 是源 benchmark 之一	训练源	评测集 YuehHanChen/CoTControl

GPT-5.5 system card verbatim: "CoT-Control includes over 13,000 tasks built from established benchmarks: GPQA, MMLU-Pro, HLE, BFCL and SWE-Bench Verified. Each task is created by pairing a benchmark problem with one CoT instruction such as avoiding certain problem-relevant keywords in CoT, using only lowercase letters, or appending a given word to each sentence." — 这把 BFCL 从"eval"提升到"训练数据源"位置,是 frontier lab 引用 BFCL 的最严肃方式。

11.7 比较矩阵 — Stage × Tool × License × BFCL-specific?

Tool / Stack	SFT	RL	Distill	License	BFCL-targeted?	推荐场景
NVlabs/Tool-N1	✅ LLaMA-Factory	✅ verl+GRPO	—	Apache-2.0	✅ 显式	BFCL-native RL,首选
SalesforceAIResearch/xLAM	✅ ActionStudio	—	—	code Apache-2.0 / ckpt cc-by-nc	✅ 间接(自报 BFCL top)	multi-turn SFT 冷启
MadeAgents/Hammer	✅(部分)	—	—	Apache-2.0(code)	✅ 间接	on-device tiny model
verl-project/verl	—	✅	—	Apache-2.0	❌(通用)	BFCL RL backbone
NovaSky-AI/SkyRL	—	✅	—	Apache-2.0	❌(通用)	multi-turn long-horizon
OpenPipe/ART	✅	✅ GRPO	✅	Apache-2.0	❌(MCP-RL 接近)	fast iteration
TIGER-AI-Lab/verl-tool	—	✅	—	Apache-2.0	❌	tool env wrap
NVIDIA-NeMo/RL	—	✅	—	Apache-2.0	❌	大规模 cluster
TOUCAN dataset	✅(1.5M traj)	—	—	CC-BY 4.0	✅ eval	SFT 冷启数据
APIGen-MT dataset	✅(5k MT)	—	—	cc-by-nc-4.0	✅ eval	multi-turn 补强

11.8 我的 take — 实战 BFCL 训练 stack 建议

纯 SFT 起步: Qwen3-8B(Apache-2.0 base)+ TOUCAN-1.5M(冷启 4 epoch)+ APIGen-MT-5k(multi-turn 2 epoch)。目标:V3 ≥70% → V4 NonLive ≥85% / Live ≥75%。整套用 HF trainer + LoRA 在 4×A100 一周搞定。
RL 提升: 接 Tool-N1 的 verl+GRPO,reward 用 BFCL bfcl_eval/eval_checker/ast_eval/ 的 AST checker(直接 import,这是 §11 最关键的实操点)+ multi_turn state diff。注意训 train split 和 eval split 严格隔离 — BFCL 没有官方 train/val split, 所以要自己做 hold-out。
避坑:
- 不要用 BFCL 原 ground truth 训 — 那是 eval 集合,等于污染
- 不要用 xLAM ckpt 商用(cc-by-nc),要商用必须从 base model + dataset 重训
- 不要只追 BFCL Overall — 拆分看 multi_turn vs irrelevance,常常一升一降(xLAM 案例就是典型)
下一步(把 BFCL 当 prerequisite): BFCL V4 Agentic ≥50% 后,接 MCP-Atlas / MCPMark(#21),BFCL 不能替代真 MCP eval(§8.2)。