BFCL (Berkeley Function-Calling Leaderboard) — function-calling 事实标准

UC Berkeley · Gorilla LLM Team (Sky Computing Lab) · Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, Joseph E. Gonzalez · ICML 2025 (paper) · V1: 2024-02-26 → V2: 2024-08-19 → V3: 2024-09-19 → V4: 2025-07-17
Live leaderboard · GitHub: ShishirPatil/gorilla · HF Dataset · ICML 2025 paper · PyPI: bfcl-eval
关键词: function-calling · tool use · AST eval · multi-turn · agentic · Web Search · Memory · Format Sensitivity · Apache-2.0

速读卡片 (TL;DR)

一句话: BFCL 是 UC Berkeley Sky Computing Lab(Gorilla LLM 团队 / Patil-Mao-Stoica-Gonzalez)在 2024-02 推出的 第一个综合且可执行的 function-call 评测,自身定位"the first comprehensive and executable function call evaluation",目前是 function-calling 领域的事实标准(de-facto standard) —— 微软 / Salesforce / Anthropic / Google / OpenAI 的 frontier model card 和 Qwen3 / GLM / Kimi / DeepSeek / xLAM / Hammer / Granite / Functionary 等几乎所有面向"工具调用"的小/中模型都把 BFCL 当训练目标。

V4 (2025-07-17)
现行版本 — 加入 agentic Web Search + Memory + Format Sensitivity
~4,951
scoring task 数量(2,251 Live + 1,610 Non-Live + 1,000 Multi-Turn + ~90 agentic)
GLM-4.5 70.9%
V4 当前 SOTA(open-weight 反超 GPT-5 59.2%)
Apache-2.0
repo + dataset + 评测脚本全开源 · 49,556 下载/月

立场:#21 MCP benchmark 横评#26 代码级深潜 之后,我们已经清楚地看到 MCP benchmark(MCP-Universe / Atlas / Toolathlon / MCPMark)是 stateful multi-system benchmark,BFCL 是 schema-and-trajectory benchmark —— 两者评的是不同维度。BFCL 高分(GLM-4.5 70.9% / Qwen3-32B 75.7% V3)不等于 MCP-Atlas / Universe 高分(GPT-5 在 MCP-Universe 43.72%,Claude Opus 4.7 在 MCP-Atlas 77.3%)。但 BFCL 仍然是训 ≤40B 函数调用模型的必经第一站: schema 守规矩 → live 数据健壮 → multi-turn 状态保持 → agentic 调度,这条阶梯的第一阶就是 BFCL,而且它是唯一有完整 AST 自动验证、不需要 LLM judge 的 schema-level benchmark。


§1 BFCL 是什么 — 为什么它是 function-calling 事实标准

BFCL 团队对自己的定位 (README, verbatim):

"We introduce the Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability."
repo README

而 ICML 2025 paper("The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models")abstract 最后一句说得更直接:

"Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html."
— Patil et al., ICML 2025, abstract verbatim

1.1 "事实标准"具体是什么意思

"function-calling 事实标准"这个说法分几层验证:

1.2 团队 & 学术血统

组: UC Berkeley Sky Computing Lab (前 RISE Lab) · Gorilla LLM 团队
负责人: Shishir G. Patil(PhD,Berkeley)、Huanzhi Mao(Berkeley)
资深 faculty: Ion Stoica(Spark / Anyscale / vLLM)+ Joseph E. Gonzalez(BAIR)— 这俩名字直接保证了 BFCL 不会沦为玩具,vLLM 自带 BFCL 测试就是同一拨人
姐妹项目: Gorilla: Large Language Model Connected with Massive APIs(2023-05, arXiv:2305.15334)→ OpenFunctions(2024-02 首发的 V1 BFCL,代码仓库里的核心 evaluator 还叫 openfunctions_evaluation.py)
邮件: huanzhimao@berkeley.edu(README 末尾给出的官方联系方式)
Discord 频道: discord.gg/grXXvj9Whz#leaderboard

§2 历史 — V1 → V2 → V3 → V4 演化

四代 BFCL 的演化轨迹非常清晰,每代解决前一代暴露出来的"评测过于简单 → 过拟合 → 该测的没测"的问题。下面 SVG 是时间线 + 每代新增能力的可视化。

V1 2024-02-26 AST eval Simple · Multiple Parallel · ParaMulti Java · JS · REST Relevance Detection ~1,700 tasks V2 2024-08-19 + Live data enterprise + OSS contributed 真实场景 prompt irrelevance: 882 +2,251 tasks V3 2024-09-19 + Multi-Turn base 200 miss-param 200 miss-func 200 long-ctx 200 +1,000 multi-turn V4 (现行) 2025-07-17 + Agentic Web Search (SerpAPI) Memory (KV / Vec / RecSum) Format Sensitivity 权重重排: Agentic 40% +~90 agentic tasks 单轮 schema 守规矩 → 真实分布(Live)→ 多轮状态保持(Multi-Turn)→ 完整 agentic loop(Web/Memory) 越来越靠近 "真实 production tool agent" 但仍然 schema-grounded、自动评测 V4 overall-acc 公式重排: Non-Live 33%→10% · Live 33%→10% · Multi-Turn 33%→30% · Agentic 0%→40% · Irrelevance 0%→10%
BFCL 四代演化时间线 — 时间和数字均 verbatim 来自 CHANGELOG.mdHF dataset card。V1 日期取自 HF dataset card "Original Release: 02/26/2024"。

2.1 各版本 verbatim 说明

V1 — Simple/Parallel/Multiple Function Call eval with AST (2024-02-26)

HF dataset card 给出"Original Release: 02/26/2024"。V1 引入了核心创新 —— 用 AST(抽象语法树)做参数与类型匹配,而不是把模型输出当字符串去比较或者真的去执行。原始 categories:

V2 — Enterprise and OSS-contributed Live Data (2024-08-19)

CHANGELOG verbatim:

"[August 19, 2024] #580: Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our blog. All CLI commands have been updated to support the new dataset."

V2 Live 分布(verbatim 自 HF dataset card):

V3 — Multi-Turn & Multi-Step Function Call Evaluation (2024-09-19)

CHANGELOG verbatim:

"[Sept 19, 2024] #644: BFCL V3 release: Introduce new multi-turn dataset and state-based evaluation metric."

V3 的 1,000 条 multi-turn(官方 blog):

V4 — Agentic Web Search + Memory + Format Sensitivity (2025-07-17)

这是现行版本。CHANGELOG #1019 verbatim 给出 V4 的所有改动:

"[Jul 17, 2025] #1019: BFCL V4 release:
  1. New agentic domain — Introduces the agentic domain with two categories: Web Search and Memory Management.
  2. Revised overall-accuracy formula — As single-turn tasks approach saturation, weighting now favors complex, multi-step agentic tasks."

新权重表(这是 V4 最关键的变化):

SegmentOld %New %
Live(V2 用户贡献)3310
Non-Live(V1 学术合成)3310
Irrelevance Detection010
Multi-Turn(V3)3330
Agentic(V4 新)040

除此之外 V4 还做了:

V4 三个 Part 的细节

V4 Part 1 — Agentic Web Search(blog 15): 2025-07-17 发布。引入 multihop QA 任务 —— "100 human-crafted multihop questions spanning various domains"。需要 SerpAPI 才能跑(README 明确说: "For the web_search test category, we use the SerpAPI service to perform web search. You need to sign up for an API key…")。子 category 有 web_search_baseweb_search_no_snippet(后者强制模型 fetch + read webpage 而不是只用 search snippet)。

V4 Part 2 — Agentic Memory Management(blog 16): 同日发布,150+ 题。覆盖 5 个 domain —— College Student Advising / Customer Support / Personal To-Do List / Healthcare Patient / Finance Managing Director。3 种 memory backend(灵感来自 MemGPT / Mem0):

V4 Part 3 — Agentic Format Sensitivity(blog 17): 这一项不进入总分,只作 diagnostic — 看 prompt 模板小改动会不会让 prompting-mode 模型崩溃。


§3 task 组成详解

BFCL V4 一共 25 个 individual test category(从 TEST_CATEGORIES.md verbatim 抽取)。按 general_category 分组如下:

groupcategory(repo 内 ID)说明来源
non_live
(V1 学术合成)
simple_python简单 Python 单函数调用V1
simple_javaJava 单函数调用V1
simple_javascriptJavaScript 单函数调用V1
multiple多个备选 function 中选一个调用V1
parallel同一函数并行多次调用V1
parallel_multiple多函数并行多次调用V1
irrelevanceirrelevancefunction doc 全部不相关 → 模型必须拒绝调用V1
live
(V2 用户贡献)
live_simple用户贡献的真实简单调用V2
live_multiple用户贡献多函数V2
live_parallel用户贡献并行V2
live_parallel_multiple用户贡献多函数并行V2
live_irrelevance用户贡献的"应该拒绝"场景V2
live_relevance用户贡献的"应该调用"场景V2
multi_turn
(V3 状态)
multi_turn_base200 — 基础多轮 + 全部信息齐备V3
multi_turn_miss_func200 — 缺函数 → 应拒绝V3
multi_turn_miss_param200 — 缺参数 → 应追问V3
multi_turn_long_context200 — 长上下文 stressV3
agentic
(V4)
memory_kvKV store 后端的 read/write/searchV4
memory_vector向量库后端V4
memory_rec_sumrecursive summarization 后端V4
web_search_baseSerpAPI multihop QAV4
web_search_no_snippet强制 fetch + read webpageV4
diag(不计分)format_sensitivitysystem prompt 格式扰动诊断V4

V4 增强子集还包括: composite (3 种 multi-turn 增强组合), but blog 13 把它划归 augmented multi-turn 类。

category 描述 verbatim(README 截选)
- simple_python: Simple Python function calls. Part of the "non-live simple"
  category on the leaderboard.
- multiple: Multiple function calls in sequence.
- parallel: Multiple function calls in parallel.
- parallel_multiple: Multiple function calls in parallel and in sequence.
- irrelevance: Function calls with irrelevant function documentation.
- multi_turn_base: Base entries for multi-turn function calls.
- multi_turn_miss_func: Multi-turn function calls with missing function.
- multi_turn_miss_param: Multi-turn function calls with missing parameter.
- multi_turn_long_context: Multi-turn function calls with long context.
- memory_kv: Tests reading from and writing to a key-value memory backend.
- memory_vector: Tests reading from and writing to a vector-database memory backend.
- memory_rec_sum: Tests reading from and writing to a recursive-summarization memory backend.
- web_search_base: Base entries for web-search calls.
- web_search_no_snippet: Web-search calls where search-engine snippets are withheld,
  forcing the model to fetch and read webpages.
- format_sensitivity: Various system prompt formats to test the format sensitivity
  of the model. (Only works for prompting mode models.)

3.1 几个 prompt template 片段

Simple Python (V1) — 一道典型 simple_python 题的 ground truth(改写自 repo 中的 simple_python_x):

{
  "id": "simple_python_3",
  "question": [[{
    "role":"user",
    "content":"Calculate the area of a triangle with base 10 and height 5."
  }]],
  "function": [{
    "name": "calculate_triangle_area",
    "description": "Calculate the area of a triangle given its base and height.",
    "parameters": {"type":"dict",
      "properties":{
        "base":{"type":"integer","description":"Base length in units"},
        "height":{"type":"integer","description":"Height in units"},
        "unit":{"type":"string","description":"Unit (cm, m, etc.)","default":"units"}
      },
      "required":["base","height"]}
  }],
  "possible_answer": [{
    "calculate_triangle_area":{
      "base":[10], "height":[5], "unit":["units","cm","m"]
    }
  }]
}

注意 possible_answerunit 给了多个允许值 —— BFCL AST 允许 多 ground-truth(set membership),这是它和 string-match eval 的核心区别。

Multi-Turn (V3) — multi_turn_base 题目长这样(改写示意):

{
  "id":"multi_turn_base_15",
  "initial_config":{
    "TravelAPI":{...}, "GorillaFileSystem":{...}
  },
  "involved_classes":["TravelAPI","MessageAPI"],
  "question":[
    [{"role":"user","content":"Book me a flight from SFO to JFK on May 20."}],
    [{"role":"user","content":"Wait, change it to May 22 and message my partner."}]
  ],
  "ground_truth":[
    ["TravelAPI.book_flight(origin='SFO', dest='JFK', date='2026-05-20')"],
    ["TravelAPI.modify_booking(booking_id='...', new_date='2026-05-22')",
     "MessageAPI.send(recipient='partner', body='Flight changed to May 22')"]
  ]
}

3.2 V4 各类任务举例 — verbatim 从 repo 抽取

关于本节: 以下所有样例都是直接从 raw.githubusercontent.com/ShishirPatil/gorilla/main/berkeley-function-call-leaderboard/bfcl_eval/data/ 的 JSON 抽取的第一条或前几条数据,2026-05 拉取。配套的 possible_answer/*.json 给出 ground truth,multi_turn_func_doc/*.json 给出工具签名,eval_checker/multi_turn_eval/func_source_code/*.py 给出后端实现。读完这节你应该能具体说出每类任务"模型看到什么 / 必须输出什么 / 怎么判分"。

3.2.1 non-live · simple_python — 单函数 + 多 ground-truth

上面 §3.1 已展示;这里补一个 parallel(同一函数并行调多次)的 verbatim:

// BFCL_v4_parallel.json (第 1 条)
{"id": "parallel_0",
 "question": [[{"role": "user",
   "content": "Play songs from the artists Taylor Swift and Maroon 5,
                with a play time of 20 minutes and 15 minutes respectively, on Spotify."}]],
 "function": [{"name": "spotify.play",
   "description": "Play specific tracks from a given artist for a specific time duration.",
   "parameters": {"type": "dict",
     "properties": {
       "artist": {"type": "string", "description": "..."},
       "duration": {"type": "integer", "description": "..."}},
     "required": ["artist", "duration"]}}]}

难点: 模型必须输出两次 spotify.play 调用而不是一次带 list 的调用 —— AST 检查严格区分 parallel-call vs list-arg。

3.2.2 non-live · irrelevance — 必须不调工具

// BFCL_v4_irrelevance.json (第 1 条)
{"id": "irrelevance_0",
 "question": [[{"role": "user",
   "content": "Calculate the area of a triangle given the base is 10 meters and height is 5 meters."}]],
 "function": [{"name": "determine_body_mass_index",
   "description": "Calculate body mass index given weight and height.",
   "parameters": {"properties": {
     "weight": {"type": "float"}, "height": {"type": "float"}},
     "required": ["weight", "height"]}}]}

难点: 工具名/描述里有 height 这种诱导词。BMI 的 height 是身高,triangle 的 height 是几何高度。模型只要被 keyword 钓中就要 0 分。leaderboard 上 Irrelevance 列就是这种"拒答"准确率。

3.2.3 multi_turn · multi_turn_base — 4 轮文件操作

// BFCL_v4_multi_turn_base.json (第 1 条)
{"id": "multi_turn_base_0",
 "question": [
   [{"role":"user","content":"Move 'final_report.pdf' within document directory
                              to 'temp' directory in document. Make sure to create the directory"}],
   [{"role":"user","content":"Perform a detailed search using grep to identify sections
                              in the file pertaining to 'budget analysis'."}],
   [{"role":"user","content":"Upon identifying the requisite 'budget analysis' content,
                              sort the 'final_report.pdf' by line ..."}],
   [{"role":"user","content":"Move 'previous_report.pdf' in document directory to temp ...
                              proceed to juxtapose it with 'previous_report.pdf' ..."}]],
 "initial_config": {
   "GorillaFileSystem": {"root":{"workspace":{"type":"directory",
     "contents":{"document":{"type":"directory",
       "contents":{"final_report.pdf":{"type":"file",
         "content":"Year2024 This is the final report content including budget analysis ..."},
         "previous_report.pdf":{...}}},
       "archive":{"type":"directory","contents":{}}}}}},
   "TwitterAPI": {"tweet_counter":3, "tweets":{...},
     "username":"analyst_pro", "password":"Kj8#mP9$vL2"}},
 "path": ["GorillaFileSystem.find","GorillaFileSystem.mv","GorillaFileSystem.grep",
          "GorillaFileSystem.sort","GorillaFileSystem.diff","TwitterAPI.post_tweet"],
 "involved_classes": ["TwitterAPI","GorillaFileSystem"],
 "excluded_function": ["cp"]}

注意要点:

3.2.4 multi_turn · multi_turn_miss_func — 空轮 = 工具被故意拿掉

// BFCL_v4_multi_turn_miss_func.json (第 1 条)
{"id": "multi_turn_miss_func_0",
 "question": [
   [{"role":"user","content":"Move 'final_report.pdf' ..."}],
   [{"role":"user","content":"Perform a detailed search using grep ..."}],
   [{"role":"user","content":"... sort the 'final_report.pdf' by line ..."}],
   [],   // ← 空轮! 上一轮的 sort 函数被 "missed_function" 表抽掉了
   [{"role":"user","content":"Move 'previous_report.pdf' ... juxtapose ..."}]],
 ...,
 "missed_function": {"3": ["sort"]},  // ← 第 3 轮 sort 不可见
 "excluded_function": ["cp"]}

难点: 第 3 轮(0-indexed)给模型的可用工具列表里少了 sort,所以模型应当回答"对不起,我没有 sort 工具"而不是 hallucinate 一个调用。question 数组里出现 [] 空轮 = "本轮模型不应该有用户输入,系统让模型自由发挥"。

3.2.5 multi_turn · multi_turn_miss_param — 必须追问

// BFCL_v4_multi_turn_miss_param.json (第 1 条 — 注意倒数第二轮)
{"id": "multi_turn_miss_param_0",
 "question": [
   ...,
   [{"role":"user","content":"Move one of the file in document directory to temp ...
                              proceed to juxtapose it with 'previous_report.pdf' ..."}],
   //  ↑ 用户说 "one of the file" 但没指明哪一个
   [{"role":"user","content":"The specific file is previous_report.pdf."}]],
   //  ↑ 模型本应该追问,然后用户在下一轮给出答案
 ...}

难点: 模型如果"猜"了一个文件名直接 mv,会失败;正确行为是 asking-for-clarification(回 user 消息,不发函数调用)。这是 V3 加进来检测 "agreeable hallucination" 的核心 category。

3.2.6 agentic · web_search(V4 新)— multi-hop QA + SerpAPI

// BFCL_v4_web_search.json (第 1 条)
{"id": "web_search_0",
 "question": [[{"role":"user","content":
   "Some countries are known for producing luxury goods, including the world's most
    expensive tea. In April 2025, who is the richest billionaire (according to Forbes)
    from the country that produces the most expensive tea?"}]],
 "involved_classes": ["WebSearchAPI"]}

// possible_answer/BFCL_v4_web_search.json (第 1 条 — 注意 source 是 3-hop 推理)
{"id": "web_search_0",
 "ground_truth": ["Zhang Yiming"],   // ← 但 wikipedia 2025 第一名其实是 Zhong Shanshan
 "source": [
   {"subquestion":"Most expensive tea in the world",
    "answer":"DaHong Pao", "source":"https://camellios.com/..."},
   {"subquestion":"Country that produces DaHong Pao",
    "answer":"China", "source":"https://en.wikipedia.org/wiki/Da_Hong_Pao"},
   {"subquestion":"Who is the richest billionaire in China",
    "answer":"Zhong Shanshan", "source":"https://www.forbes.com/..."}],
 "num_hops": 3}

暴露的工具签名(multi_turn_func_doc/web_search.json):

难点 & 数据观察:

3.2.7 agentic · memory(V4 最大新增)— 全面拆解

这一节是本次 update 的重点 — 用户特别要求"尤其是 memory 是什么任务"。

Memory 任务的 2-stage 架构(关键!):

  1. Stage 1 — prereq 阶段(模型不可见,由 evaluator 预跑): 同 scenario 的 memory_prereq_conversation/memory_<domain>.json 是一段长达 6-30 轮的用户独白(顾客/学生/病人/财务/记事 5 种 domain)。evaluator 把这段独白送进模型,让模型用 memory API 自行把信息存进 memory。这一阶段的 memory dump 会被持久化到 snapshot_folder/<test_id>.json
  2. Stage 2 — 实际计分(打榜看的就是这一阶段): 把 Stage 1 持久化的 memory 重新加载,然后给模型一个单轮问题(BFCL_v4_memory.json 里的 question),问 "What is my first name?" / "How old am I?" / "What kind of latte do I occasionally like?"。模型必须用 memory_search / memory_retrieve 把答案从 memory 中查出来,并以最终自然语言答复。Ground truth 是字符串 set,如 ["35","thirty five"] 都接受。

5 个 domain(verbatim from data folder):

domain key场景prereq 独白主题
customerCustomer Support顾客 Michael 投诉 espresso machine 受损、配件、shipping 等(30 题)
healthcareHealthcare Patient病人病史 / 用药 / 急诊经历(30 题)
financeFinance Managing Director财务负责人对 portfolio 操作 / 客户记录
notetakerPersonal To-Do List日常事项 / 计划
studentCollege Student Advising学生选课 / GPA / 导师沟通

样例 verbatim — Stage 2 问题(BFCL_v4_memory.json 前 5 条):

{"id":"memory_0-customer-0","question":[[{"role":"user","content":"What is my first name?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_1-customer-1","question":[[{"role":"user","content":"How old am I?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_2-customer-2","question":[[{"role":"user","content":"Where do I live?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_3-customer-3","question":[[{"role":"user","content":"What kind of latte do I occasionally like?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}
{"id":"memory_4-customer-4","question":[[{"role":"user","content":"How many square feet is my kitchen counter?"}]],
 "involved_classes":["MemoryAPI"],"scenario":"customer"}

对应的 ground truth + source 出处(possible_answer/BFCL_v4_memory.json):

{"id":"memory_0-customer-0","ground_truth":["Michael"],
 "source":"My name is Michael, and this is my first time interacting with your company..."}
{"id":"memory_1-customer-1","ground_truth":["35","thirty five"],
 "source":"I'm 35 years old, live in Seattle..."}
{"id":"memory_2-customer-2","ground_truth":["Seattle"],
 "source":"I'm 35 years old, live in Seattle..."}
{"id":"memory_3-customer-3","ground_truth":["strawberry matcha"],
 "source":"...I occasionally like a strawberry matcha latte..."}
{"id":"memory_4-customer-4","ground_truth":["38","thirty eight"],
 "source":"...my counter is only 38 square feet."}

Stage 1 的 prereq 独白长这样(memory_prereq_conversation/memory_customer.json 节选):

{"id":"memory_prereq_0-customer-0","topic":"First-Time Inquiry About a Product",
 "question": [
   [{"role":"user","content":"Hey there! Thanks for getting back to me ...
                              My name is Michael, and this is my first time
                              interacting with your company in any way..."}],
   [{"role":"user","content":"I'm 35 years old, live in Seattle, and am
                              pretty serious about both my work and my hobbies.
                              I work as a freelance graphic designer..."}],
   ... // 30+ 轮长独白
 ]}

注意:Stage 1 的每一轮都是 user → assistant,assistant 边读边决定要不要调 core_memory_add / archival_memory_add 来记下"Michael / 35 / Seattle / strawberry matcha / 38 sqft"等关键事实。Stage 1 模型的记什么 / 不记什么会直接决定 Stage 2 能不能查出来。

3.2.7.A memory_kv 子类 — key-value 后端

来源: eval_checker/multi_turn_eval/func_source_code/memory_kv.pyMemoryAPI_kv。设计灵感来自 MemGPT —— core memory(短期,容量 7 条,每条 ≤300 字符)+ archival memory(长期,容量 50 条,每条 ≤2000 字符)双层结构。

# memory_kv.py verbatim
MAX_CORE_MEMORY_SIZE = 7
MAX_CORE_MEMORY_ENTRY_LENGTH = 300
MAX_ARCHIVAL_MEMORY_SIZE = 50
MAX_ARCHIVAL_MEMORY_ENTRY_LENGTH = 2000

class MemoryAPI_kv(MemoryAPI):
    """A class that provides APIs to manage short-term and long-term
       memory data in a key-value format."""

    def core_memory_add(self, key: str, value: str) -> Dict[str, str]:
        """Add a key-value pair to the short-term memory.
           Keys must be snake_case and cannot contain spaces."""
        # 检查容量 / 长度 / key 格式 / 重复
        ...

    def core_memory_remove(self, key: str) -> Dict[str, str]: ...
    def core_memory_replace(self, key: str, value: str) -> ...
    def core_memory_clear(self) -> ...
    # 同名一组 archival_memory_* 方法
    def archival_memory_key_search(self, query: str, k: int = 5):
        """Search for key names ... using BM25+ algorithm."""

暴露给模型的 16 个工具签名(multi_turn_func_doc/memory_kv.json): core_memory_add / remove / replace / clear / retrieve / key_search / list_keys + 同名一组 archival_memory_*注意 key 必须 snake_case,这是 V4 故意加的 format 约束,逼模型规范命名。

难点: core memory 只能放 7 条 → 模型必须做 importance ranking(决定 "Michael 名字" 进 core 还是 archival),否则容易溢出。检索是 BM25+ 词袋,所以同义词检索能力差(写入是 customer_first_name,查 "name" 能命中,查 "user" 就不一定)。

3.2.7.B memory_vector 子类 — 向量库后端

来源: memory_vector.pyMemoryAPI_vector,内部用 sentence-transformers all-MiniLM-L6-v2 + FAISS(verbatim 见下)。

# memory_vector.py verbatim
ENCODER = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
ENCODER_DIM = ENCODER.get_sentence_embedding_dimension()

class MemoryAPI_vector(MemoryAPI):
    def __init__(self):
        self.core_memory = VectorStore(max_size=7, max_entry_length=300)
        self.archival_memory = VectorStore(max_size=50, max_entry_length=2000)

    def core_memory_add(self, text: str) -> dict[str, str]:
        """Add a new entry to the core memory.
           Returns: id (int): The ID of the added entry."""
        return self.core_memory.add(text)

    def core_memory_retrieve(self, query: str, top_k: int = 5) -> list[dict]:
        """Retrieve the most similar entries from the core memory."""
        # FAISS L2 nearest-neighbor search

区别 vs kv: 不需要 key —— core_memory_add(text) 只传文本,系统自动 encode 并返回数字 id;检索靠 cosine/L2 similarity 不是关键词。

难点: 模型不能用"key 提示"自己,完全依赖 embedding 召回。Stage 1 写入时如果 batch 多条信息进一条 entry(如 "name=Michael, age=35, city=Seattle"),Stage 2 任何相关 query 都能拉回但 top-k 噪声大;反之每条只存一个事实,Stage 2 又可能 top-5 召不到。这就是经典 RAG chunk-size tradeoff

3.2.7.C memory_rec_sum 子类 — recursive summarization 后端

来源: memory_rec_sum.pyMemoryAPI_rec_sum。这一个最特殊:memory 就是一个 plain string,容量 10000 字符,只有 5 个工具。

# memory_rec_sum.py verbatim
MAX_MEMORY_ENTRY_LENGTH = 10000  # 10k characters

class MemoryAPI_rec_sum(MemoryAPI):
    def __init__(self):
        self.memory = ""   # ← 单一字符串,不是 dict / 不是 list

    def memory_append(self, text: str) -> Dict[str, str]:
        """Append a new text to the end of the memory."""
        # 长度溢出就报错,逼模型 summarize

    def memory_update(self, text: str) -> Dict[str, str]:
        """Update the memory with new text.
           This will replace the existing memory content."""

    def memory_replace(self, old_text, new_text) -> ...
    def memory_clear(self) -> ...
    def memory_retrieve(self) -> ...  # 拿回整段字符串

核心 design intent: 这就是 Claude 风格的 "rolling summary" memory。模型必须每隔 N 轮把现有 memory 取出来、压成更短的 summary,再 memory_update 写回 —— "递归"二字就在于反复 retrieve → summarize → overwrite。memory_replace(old, new) 是细粒度修订,等价于 sed。

难点: 没有索引,Stage 2 查询时只能拿回 整段 10k 字符串 然后塞进 context 让 LLM 自己 grep —— 这也是为什么这个子类的 token cost 通常最高(每次 retrieve 都把全部 memory 灌回 prompt)。但反之"召回率"理论 100%(整段都给你),所以 leaderboard 上 rec_sum 经常分数偏高但 latency / cost 最差。这对应 #29 EnvTuning paper 中提到的 RecSum / KV / Vector 三分,与该 paper 的 "RL-based memory management" 直接对照。

3.2.7.D Memory 评测的判分

memory 任务和 multi-turn 用同一个 state-based checker(multi_turn_checker.py),但额外多两步:

3.2.8 diag · format_sensitivity(V4 新,不进总分)

同一个 multi_turn_base 任务,但 system prompt 用多种格式扰动模板包装(JSON vs YAML vs XML vs Markdown,有无 emoji,大小写等),看模型在不同模板下答案稳定性如何。结果体现在 leaderboard 的 FmtΔ 列(format gap)—— Claude Opus 4.5 prompt-mode 在这一列跌 44 pp(见 §6.2),说明 prompting-mode 模型对 system prompt 模板极度敏感,而 FC handler 几乎不受影响。


§4 评测方法

BFCL 同时支持两种核心评测,V3 起又增加 state-based + response-based 双重 multi-turn 评测。

4.1 AST evaluation(V1 起,所有 single-turn 用)

AST checker 的核心 dispatch(verbatim 自 bfcl_eval/eval_checker/ast_eval/ast_checker.py):

def ast_checker(
    func_description,
    model_output,
    possible_answer,
    language: Language,
    test_category: str,
    model_name: str,
):
    if "parallel" in test_category:
        return parallel_function_checker_no_order(
            func_description, model_output, possible_answer, language, model_name
        )
    elif "multiple" in test_category:
        return multiple_function_checker(
            func_description, model_output, possible_answer, language, model_name
        )
    else:
        return simple_function_checker(
            func_description[0], model_output[0], possible_answer[0], language, model_name
        )

三个 sub-checker 都基于真正解析模型输出为 AST(Python: ast.parse;Java/JS: 各自的 tree-sitter 风格 parser),然后逐层匹配:

  1. 函数名匹配 —— 调对函数没?
  2. 参数名匹配 —— 必填参数都给了没?
  3. 类型匹配 —— int 不能给成 str(strict on typing)。但有个例外: int → float Python 自动转,允许放宽(CHANGELOG 2024-06-07 #407 加入)。
  4. 值匹配 —— 落在 possible_answer 的合法值集合里就算对(允许 multi-ground-truth)。
  5. String standardization(2024-04-03 #309 引入)—— 去白空格 + 去 ,./-_*^ 子集标点,让 AST 匹配更鲁棒。

parallel 类用 no-order matching(可以乱序,数量必须对);multiple 用 all-or-nothing("strict all-or-nothing for multiple/parallel scenarios" — blog 8 verbatim)。

4.2 Executable evaluation(V1 引入,V4 在 2025-04-09 #943 整体退役)

原来的 exec_simple / exec_parallel / exec_multiple / exec_parallel_multiple / rest 真的会去 import 一个函数包并执行,然后用四种 criteria 之一:

注意 — V4 已经退役 executable category。 CHANGELOG 2025-04-09 #943 verbatim: "Retire the executable categories from the leaderboard. The following categories will be excluded from the evaluation pipeline: rest, exec_simple, exec_parallel, exec_multiple, exec_parallel_multiple." 退役的核心原因是 真实 API 太不稳定(weather/news/finance API 几乎每周都坏一次),AST + state-based 更可控。

4.3 State-based evaluation(V3 起,所有 multi-turn 用)

这是 V3 最大的方法学创新。整套 multi-turn 包含一个 backend simulator(GorillaFileSystem / TravelAPI / MessageAPI / VehicleControlAPI / TicketAPI / TradingBot / MathAPI)。模型每发一个工具调用,backend 实际执行并改变 state。评测时(blog 13 verbatim):

"State-based evaluation: Compares backend system attributes after all function calls complete each turn, capturing correctness of write/delete operations.

Response-based evaluation (subset-matched): Validates execution paths against minimal viable ground truth trajectories, ensuring read-only requests succeed while allowing reasonable alternative paths rather than requiring exact trajectory matching.

Human-labeled ground truth trajectories underpin all evaluation."

关键设计点:

4.4 Hallucination detection(贯穿所有版本)

irrelevance / live_irrelevance / multi_turn_miss_func 三类都是测拒答能力 —— function doc 里没有合适工具时,模型应该用自然语言回答而不是硬编一个 function name。Hallucination = 在没有合适工具的时候编出一个。


§5 开源现状 — repo / dataset / 评测脚本 / 自托管

组件位置状态
代码仓库ShishirPatil/gorilla / berkeley-function-call-leaderboard/主目录,Apache-2.0
PyPI 包pip install bfcl-eval2025-06-08 #1054 发布,注意: PyPI 上有个 unrelated 的 bfcl 包,别装错
HF datasetgorilla-llm/Berkeley-Function-Calling-Leaderboard11.7 MB,49,556 下载/月,Apache-2.0。不能用 load_dataset(),得手工 jsonl 读
Live leaderboardgorilla.cs.berkeley.edu/leaderboard.html每月更新,最后 commit 2026-04-12 (f7cf735)
评测脚本bfcl_eval/eval_checker/ast_eval/ast_checker.py + multi_turn_eval/ + executable_eval/ (legacy)
模型 handlerbfcl_eval/model_handler/base_handler.py + 每个 model 一个 handler 文件
LicenseApache-2.0(README 末尾:"All the leaderboard statistics, and data used to train the models are released under Apache 2.0")训练数据可商用

5.1 三步自托管完整流程(README verbatim)

# 1. 装环境
conda create -n BFCL python=3.10
conda activate BFCL
git clone https://github.com/ShishirPatil/gorilla.git
cd gorilla/berkeley-function-call-leaderboard
pip install -e .

# 2. 生成 model 响应(以 OSS Qwen3-8B 为例)
bfcl generate \
  --model Qwen/Qwen3-8B-FC \
  --test-category multi_turn \
  --backend vllm \
  --num-gpus 1 \
  --gpu-memory-utilization 0.9

# 3. 评测
bfcl evaluate --model Qwen/Qwen3-8B-FC --test-category multi_turn

# 输出: score/Qwen/Qwen3-8B-FC/multi_turn/*.json
#       score/data_overall.csv
#       score/data_multi_turn.csv

关键 flag:


§6 完整 leaderboard — V4 全量(2026-05 quote)

数据源(已升级 — 直接拉官方 CSV): 上一版数据来自第三方镜像存在 1-3 pp 漂移;本版直接 curl https://gorilla.cs.berkeley.edu/data_overall.csv(及 data_live.csv / data_non_live.csv / data_multi_turn.csv / data_agentic.csv / data_format_sensitivity.csv),拿到的是官方 leaderboard JS 渲染时所读的 同一份 CSV(JS 代码 verbatim: const csvFilePath = `./data_${datasetName}.csv`)。截止 2026-05-13 抓取,共 109 个 model row,这是当前官方最权威的 V4 数据。下面分组展示全量。

6.1 表头解读 — V4 列含义

官方 CSV 共 36 列,本文聚焦下面 8 个语义维度(其它如 cost / latency 在 §7 提交流程那块用):

含义权重(V4 新公式)
Overall Acc官方总分(下面 5 段加权平均)
Non-Live ASTAST 对比 ground truth call(BFCL V1 起的核心)10%
Live用户贡献的真实场景 prompt(V2 引入)10%
Multi Turn4 子类(base / miss_func / miss_param / long_context)30%
Web SearchV4 新 — base + no_snippet 两种网络搜索(含在 40% agentic)
MemoryV4 新 — KV / Vector / Recursive Summarization 三种后端(含在 40% agentic)
Irrelevance Detection"该拒答时拒答"的能力10%
Format Sensitivity Max Delta5 种 system prompt 变体下分数的最大跌幅(诊断 only,不计分)

6.2 (a) Proprietary frontier 模型 — V4 全量

来自 data_overall.csv 2026-05-13 截图,所有 Anthropic / OpenAI / Google / xAI / Cohere / Mistral / Amazon proprietary 模型(已剔除小数;FormatΔ N/A 表示官方未跑此 diagnostic):

RankModelOrgOverallNonLiveLiveMultiTurnWebSrchMemoryIrrelFmtΔ
1Claude-Opus-4-5-20251101 (FC)Anthropic77.4788.5879.7968.3884.5073.7684.72N/A
2Claude-Sonnet-4-5-20250929 (FC)Anthropic73.2488.6581.1361.3781.0064.9586.61N/A
3Gemini-3-Pro-Preview (Prompt)Google72.5190.6583.1260.7580.0061.7285.598.5
5Grok-4-1-fast-reasoning (FC)xAI69.5788.2778.4658.8782.5053.9879.43N/A
6Claude-Haiku-4-5-20251001 (FC)Anthropic68.7086.5078.6853.6283.5054.4185.11N/A
7Gemini-3-Pro-Preview (FC)Google68.1485.7581.7263.1268.5054.8477.85N/A
8o3-2025-04-16 (Prompt)OpenAI63.0581.9473.2162.2550.5051.8383.988.5
9Grok-4-0709 (Prompt)xAI62.9782.7572.5447.0074.0050.5484.3013.0
10Grok-4-0709 (FC)xAI61.3885.3875.5733.8882.0055.9175.40N/A
12Grok-4-1-fast-non-reasoning (FC)xAI58.2988.1377.9446.7575.0026.2474.09N/A
13Command A Reasoning (FC)Cohere57.0686.2778.6150.1255.5028.8286.75N/A
15Gemini-2.5-Flash (FC)Google56.2484.9674.3936.2559.0041.2993.67N/A
16GPT-5.2-2025-12-11 (FC)OpenAI55.8781.8570.3928.1275.5045.8179.42N/A
17GPT-5-mini-2025-08-07 (FC)OpenAI55.4669.8558.6227.5082.0044.3091.01N/A
20GPT-4.1-2025-04-14 (FC)OpenAI53.9682.7969.9538.8868.0023.8786.52N/A
21o4-mini-2025-04-16 (FC)OpenAI53.2437.7366.1041.7575.5034.1983.91N/A
24GPT-5-nano-2025-08-07 (FC)OpenAI51.4568.0059.4434.5072.5024.7389.10N/A
26Gemini-2.5-Flash (Prompt)Google50.9088.0878.1616.7562.0038.7191.099.0
27GPT-4.1-mini-2025-04-14 (FC)OpenAI50.4583.8368.8434.1357.0026.8881.69N/A
28o4-mini-2025-04-16 (Prompt)OpenAI50.2681.2970.7616.6271.5035.2787.169.5
30o3-2025-04-16 (FC)OpenAI48.5640.3866.1714.7577.0047.3186.13N/A
35Command A (FC)Cohere46.4987.5678.5329.5046.5016.5684.19N/A
38GPT-5.2-2025-12-11 (Prompt)OpenAI45.2778.2967.1443.7540.503.8787.2613.0
46mistral-large-2411 (FC)Mistral AI38.3784.6581.8714.1228.0024.9568.92N/A
49Mistral-Medium-2505 (FC)Mistral AI37.5667.4467.9510.7535.0023.0191.95N/A
52Gemini-2.5-Flash-Lite (FC)Google36.8786.6065.8013.5021.0020.6592.50N/A
57Claude-Opus-4-5-20251101 (Prompt)Anthropic33.4789.6576.0216.1213.001.9490.7513.0
61Command R7B (FC)Cohere32.0780.9669.068.2527.005.1681.65N/A
76palmyra-x-004 (FC)Writer27.8787.4677.870.382.5013.1280.99N/A
80Amazon-Nova-2-Lite-v1:0 (FC)Amazon27.1086.9680.832.125.002.3782.11N/A
88Amazon-Nova-Pro-v1:0 (FC)Amazon24.9786.5878.531.882.501.9470.06N/A
frontier 上重要观察:
  1. Anthropic 用 FC handler 翻盘: Claude-Opus-4-5 (FC) = 77.47 抢回 V4 第一 vs Claude-Opus-4-5 (Prompt) = 33.47 — 同一 weights, prompt-mode 跌 44 pp。这就是 §6 章首"FormatΔ"的极端反例。前一版的"Claude V3 25.3 → V4 ~70"分析这次得到官方 CSV 验证。
  2. GPT-5.2 在 Prompt+Thinking 之外 multi-turn 大幅退化: GPT-5.2 (FC) MultiTurn 28.12 比 Claude Opus 4.5 (FC) 的 68.38 差 40 pp。这是过去 6 个月最大反直觉 — 业界预期"reasoning 模型 multi-turn 更稳",数据完全相反。
  3. o3 (Prompt) 63.05 vs o3 (FC) 48.56: OpenAI 自家 FC handler 在 o3 上反而拖后腿 14 pp,与 Anthropic FC 翻盘正相反。两家 FC API 设计的 trade-off 公开可见。

6.3 (b) Open-weight ≥40B 模型

RankModelOrgLicenseOverallNonLiveLiveMultiTurnWebSrchMemoryIrrel
4GLM-4.6 (FC thinking)Zhipu AIMIT72.3887.5680.9068.0077.5055.7084.96
11Kimi-K2-Instruct (FC)MoonshotAImodified-MIT59.0681.6078.6850.6366.5029.0387.34
14DeepSeek-V3.2-Exp (Prompt+Thinking)DeepSeekMIT56.7385.5276.0244.8858.0044.0967.00
19DeepSeek-V3.2-Exp (FC)DeepSeekMIT54.1234.8553.6637.3869.5054.1993.18
23Qwen3-235B-A22B-Instruct-2507 (Prompt)QwenApache-2.052.1590.3378.6844.6250.5019.3578.89
31Qwen3-235B-A22B-Instruct-2507 (FC)QwenApache-2.047.9937.4068.9145.3854.0023.8781.73
50Llama-4-Maverick-17B-128E-FP8 (FC)MetaLlama 4 Community37.2988.6573.6520.2528.0018.9255.97
62Llama-3.3-70B-Instruct (FC)MetaLlama 3 Community31.9088.0276.6121.5010.008.1753.53
72Llama-4-Scout-17B-16E (FC)MetaLlama 4 Community28.1389.3874.699.0014.508.1744.92
74CoALM-70BUIUC + OumiLlama 3 Community27.9983.4467.2810.620.005.8185.65
108Llama-3.1-Nemotron-Ultra-253B-v1 (FC)NVIDIANVIDIA Open10.000.000.000.000.000.00100.00
注意 GLM-4.6 (MIT) 是当前 V4 open-weight 第一,72.38% 紧追 Claude Opus 4.5 的 77.47%。Z AI 这个模型的 BFCL 数字几乎能与 Anthropic 第一打平,这是 2026 H1 open-weight 最大新闻之一。Kimi-K2-Instruct(modified-MIT)59.06 是次席。注意 Llama-3.1-Nemotron-Ultra-253B-v1 (FC) 在官方 CSV 上是 10%,几乎为 0 的 NonLive/Live/MultiTurn ——这不是数据错误,而是 handler 不兼容/output 全 reject,Irrelevance 飙到 100% 就是"什么都拒答"的退化解(见 §6.5)。

6.4 (c) Open-weight ≤40B 模型 — 训 ≤40B 最关心的子榜

RankModelParamsLicenseOverallNonLiveLiveMultiTurnWebSrchMemoryIrrel
25Nanbeige4-3B-Thinking-2511 (FC)3BApache-2.051.4081.5879.4251.1221.5036.7783.09
29Qwen3-32B (FC)32B denseApache-2.048.7188.7782.0147.8721.5026.6776.37
32Nanbeige3.5-Pro-Thinking (FC)Apache-2.047.6838.3569.9540.0042.0045.1674.20
33Qwen3-32B (Prompt)32B denseApache-2.046.7890.2782.0143.2526.0015.7082.39
36BitAgent-Bounty-8B8BApache-2.046.2381.6093.1262.380.001.5197.48
39Qwen3-8B (FC)8BApache-2.042.5787.5880.5341.7512.0014.6279.07
40ToolACE-2-8B (FC)8BApache-2.042.4487.1077.4238.388.5018.4990.79
41Qwen3-30B-A3B-Instruct-2507 (FC)30B/A3BApache-2.041.3985.7777.9430.0022.5017.6379.90
43Qwen3-14B (FC)14BApache-2.041.0384.9480.0134.7510.0019.5781.94
54Qwen3-4B-Instruct-2507 (FC)4BApache-2.035.6887.8876.3922.123.0017.6384.93
66Gemma-3-12b-it (Prompt)12BGemma TOU30.4379.4474.245.754.0027.5370.29
69Gemma-3-27b-it (Prompt)27BGemma TOU29.4787.1774.5410.750.0013.5573.67
70Phi-4 (Prompt)14BMIT28.7969.5660.703.884.5024.7387.55
81Granite-3.1-8B-Instruct (FC)8BApache-2.027.1078.3360.337.500.5014.4179.98
82Falcon3-10B-Instruct (FC)10BFalcon-LLM27.0185.0075.436.501.5027.5332.09
83Granite-3.2-8B-Instruct (FC)8BApache-2.026.8779.7760.337.380.5012.4780.53
85Llama-3.1-8B-Instruct (Prompt)8BLlama 3 Community25.8384.0070.7611.123.0010.7542.70
86MiniCPM3-4B-FC (FC)4BApache-2.025.5581.7565.213.880.0012.0472.84
93Granite-20b-FunctionCalling (FC)20BApache-2.023.2382.3558.705.380.000.0075.13
103Granite-4.0-350m (FC)0.35BApache-2.018.9867.9246.112.500.503.2360.84

Top 3 ≤40B open-weight on V4 (verbatim from official CSV):

  1. Nanbeige4-3B-Thinking-2511 (FC) — 51.40% · 3B Apache-2.0 · 这是 2026-05 最大惊喜:3B 模型上 51.40%, multi-turn 51.12% 超过 Qwen3-32B 的 47.87%。
  2. Qwen3-32B (FC) — 48.71% · 32B Apache-2.0
  3. Nanbeige3.5-Pro-Thinking (FC) — 47.68% · Apache-2.0

6.5 (d) Specialized fine-tuned tool models

RankModelParamsLicenseOverallNonLiveLiveMultiTurnWebSrchMemoryIrrel
18xLAM-2-32b-fc-r (FC)32Bcc-by-nc-4.054.6689.6075.5069.5025.5020.8680.23
22Llama-xLAM-2-70b-fc-r (FC)70Bcc-by-nc-4.053.0788.4472.1777.3815.0014.4179.11
34Llama-xLAM-2-8b-fc-r (FC)8Bcc-by-nc-4.046.6884.5867.9570.006.5013.9863.28
37Arch-Agent-32B32Bkatanemo-research45.3788.9280.6854.255.0014.6282.15
42xLAM-2-3b-fc-r (FC)3Bcc-by-nc-4.041.2282.9662.9258.382.5011.4063.45
56Arch-Agent-3B3Bkatanemo-research35.3686.6772.9134.880.506.8874.67
60Arch-Agent-1.5B1.5Bkatanemo-research32.1482.6767.7326.620.008.1774.83
64Hammer2.1-7b (FC)7Bcc-by-nc-4.031.6785.5069.5023.870.000.0090.12
65xLAM-2-1b-fc-r (FC)1Bcc-by-nc-4.030.4469.0455.1436.000.003.8764.47
68Hammer2.1-3b (FC)3Bqwen-research29.7184.9670.5416.500.003.0186.12
75Hammer2.1-1.5b (FC)1.5Bcc-by-nc-4.027.8882.9869.5015.620.000.0079.40
84CoALM-8B8BLlama 3 Community26.8184.8766.778.000.002.8086.90
100Hammer2.1-0.5b (FC)0.5Bcc-by-nc-4.021.2265.9854.632.880.001.0880.79
specialized 模型的失败模式可以从 CSV 直接看出:

6.6 ≤40B open-weight 子榜历史比对(沿用上一版)

#27 小模型 MCP landscape 那群读者,把上面 V4 数据接到我们已有的 V3 / 自报数字上,得到这张跨版本子榜:

模型paramsBFCL V3 (旧)BFCL V4 (官方 CSV)License
Nanbeige4-3B-Thinking-25113B51.40Apache-2.0
Qwen3-32B (FC)32B dense75.7 (V3 镜像)48.71Apache-2.0
Qwen3-30B-A3B-Instruct-2507 (FC)30B / 3B-active MoE~7041.39Apache-2.0
TOUCAN-Qwen2.5-32B (#22)32B dense70.45(未单独上 V4 — 仅 dataset)Apache-2.0 (data+ckpt)
Qwen3-8B + EnvScaler (#23)8BBFCL-MT 41.88MIT
AWM-Qwen3-Thinking-8B (#18)8B53.83 → 65.94 OOD
Salesforce/xLAM-2-32b-fc-r32B~6254.66cc-by-nc-4.0
ToolACE-2-8B8B~5542.44Apache-2.0
BitAgent-Bounty-8B8B~5446.23Apache-2.0
Phi-414B40.828.79MIT

V4 数字大面积比 V3 低 10-30 pp,这是 V4 把 multi-turn+agentic 加权 70% 的直接结果(V3 是 33%)。注意 Nanbeige4-3B 在 V4 上爆冷拿 ≤40B 第一;Qwen3-32B 不再是 ≤40B 顶点 — 是 V4 改公式后最大的"洗牌"。


§7 提交流程

BFCL 的提交流程 不像 MCPMark / Atlas 那样有 web submission portal — 它是 纯 PR-based。README "Contributing" verbatim:

"We welcome contributions! To add a new model:
  1. Review bfcl_eval/model_handler/base_handler.py and/or bfcl_eval/model_handler/local_inference/base_oss_handler.py (if your model is hosted locally).
  2. Implement a new handler class for your model.
  3. Update bfcl_eval/constants/model_config.py.
  4. Submit a Pull Request.
For detailed steps, please see the Contributing Guide."

实操步骤:

  1. fork ShishirPatil/gorilla
  2. bfcl_eval/model_handler/{api,local_inference}/ 实现一个 handler 类(参考 OpenAIHandler / AnthropicHandler / QwenHandler)
  3. 把 model name + handler 类注册到 bfcl_eval/constants/model_config.py
  4. 本地跑 bfcl generate + bfcl evaluatescore/data_overall.csv
  5. 提 PR,附 data_overall.csv + data_live.csv + data_multi_turn.csv
  6. Berkeley 团队 review 后合并 + 更新 live leaderboard
注意: BFCL 不要求自报数字必须复现 —— 维护者会用同一套 commit(目前 f7cf735, 2026-04-12)在 Berkeley 自己的硬件上重测。这点比 MCPMark(#21)严格,后者接受自报但只标 "self-reported"。

§8 局限 + 与 MCP benchmark 的本质区别

8.1 BFCL 本身的几个局限

  1. Schema-based, not stateful real-MCP — BFCL multi-turn 用的是 7 个简化的 Python class backend(GorillaFileSystem / TravelAPI / MessageAPI / VehicleControlAPI / TicketAPI / TradingBot / MathAPI),不是真实的 OAuth-authenticated MCP server。所以"BFCL 高分"不能保证模型在真实 Notion / GitHub / Stripe API 上工作。
  2. Web Search 强绑 SerpAPI — 你想跑 V4 的 web_search category,必须有 SerpAPI key,且数字会随互联网内容漂移。这是 V4 唯一的"非确定性"评测部分。
  3. Live category 受用户贡献质量影响 — V2 Live 数据来自社区 PR,质量不均(CHANGELOG 里反复在修单条 ground truth bug:#600 #1185 #1175 #1086 …)。
  4. Format Sensitivity 暴露了"分数易碎" — V4 引入这一类的本意是承认:很多模型在 BFCL 上跑 60% 跑 70% 完全取决于 system prompt 怎么写。
  5. 没有真长 horizon — V3 multi-turn 主要 1-3 轮,V4 web/memory 也少有 >10 轮。对比 Toolathlon(#21)的 50+ tool 多 hop 任务,BFCL agentic 更像"短 vignette"。
  6. 评测语言/生态局限 — Java 100 + JavaScript 50 数量太少,REST 70 已被 V4 退役。BFCL 实际上是 Python-only 的 function-calling benchmark

8.2 为什么 "BFCL 高分 ≠ MCP benchmark 高分"

这是 #21#26 反复验证的命题。把核心差异提炼成一张表:

维度BFCLMCP-Atlas / Universe / Toolathlon / MCPMark
tool 接口Python class API(纯 schema)真实 MCP protocol (JSON-RPC + OAuth)
statesimple in-memory backend真 stateful service(GCP / Notion / Stripe / Slack / Filesystem)
评测器AST + state diff(无 LLM judge)Atlas: Gemini-2.5-Pro claim judge · Universe: format+static+dynamic 三层 · MCPMark: verify.py 真断言 · Toolathlon: 真副作用
轮次1-10 turn10-50+ turn
失败模式错调 / 错参数 / 不拒答OAuth token 失效 / API 限流 / 半完成 state / 副作用泄漏
价格(跑一次完整 eval)~$50 API + 几小时 GPU$500-2000 API + 真 cloud env
"高分"含义schema 守得住 + 状态记得清能在真生产 stack 完成任务

所以 GPT-5 在 BFCL V4 只有 59.2 但在 MCP-Atlas 上其它数字更分散(GPT-5.4 在 MCP-Atlas 68.1%),而 Claude Opus 4 在 BFCL V3 25.3 但在 MCP-Atlas Live 是 77.3 —— 评的根本就是不同维度。从 #26 的结论上看:

"不能一个 ckpt SOTA 全部 benchmark,但可以打到各 80%。 联合训配方: TOUCAN 冷启(BFCL) + MCPMark/Universe RLVR(schema 严格 verify) + Toolathlon long-horizon SFT + Atlas LoRA 隔离。"
#26 结论 verbatim

§9 frontier model 官方引用情况

这是 #19#21 已经形成的考察套路 —— 一个 benchmark 是不是"事实标准",看 frontier model card 直接引不引。

frontier model是否在官方 card 引用 BFCL具体引用方式
Claude Opus 4.7 (Anthropic, 2026-04-16)官方 card 未直接引用 BFCLAnthropic 主推 MCP-Atlas (77.3% Live);BFCL handler 由社区维护(CHANGELOG #1019 加 claude-opus-4-1)
GPT-5 (OpenAI, 2025-08-07)是 — V4 内建 handlerOpenAI 没有 standalone BFCL claim,但 GPT-5 / 5-mini / 5-nano 在 V4 #1019 中作为 "New model support" 显式加入
GPT-5.5 (OpenAI, 2026 早期)是 — system card verbatim 引用GPT-5.5 system card: "CoT-Control includes over 13,000 tasks built from established benchmarks including BFCL (Patil et al., 2025)" — BFCL 被当成 CoT 训练源
Gemini 3.1 Pro (Google DeepMind, 2025-12)未在 model card 显式引用官方 card 强调 "meaningful improved tool use",但具体到 BFCL 数字没给。第三方 (awesomeagents.ai) 报 Gemini 3 Flash Preview Thinking V3 53.5%
Kimi K2.5 (Moonshot)是 — 官方 paper 报 BFCLK2.5 paper / blog 把 BFCL 当主线 eval; #models/06
Qwen3 系列 (Alibaba)是 — 自报 V3 + V4Qwen team blog 直接把 BFCL 列入 release note;Qwen3.5 各 size 单独在 BFCL V4 上报数(llm-stats.com)
GLM-4.5 / 4.7 (Z AI)当前 V3+V4 双榜冠军,Z AI 官方 paper 必引
INTELLECT-3 (Prime Intellect)#models/08: BFCL V3 63.5

结论: GPT-5.5 是唯一一个把 BFCL 写进 system card 的西方 frontier model(还是作为训练数据源而非 eval target)。Anthropic 和 Google 更倾向于 MCP-Atlas / 自家 agentic benchmark。但中国厂(Qwen / GLM / Kimi)+ 训练向小模型(xLAM / Hammer / TOUCAN / EnvScaler / AWM)几乎 100% 引用 BFCL。所以 BFCL 在"open-weight + ≤40B function-calling"领域几乎是法定 KPI,在frontier proprietary 领域是次选


§10 我的 take + 实用建议

10.1 我的 take

  1. BFCL 是 schema-level standard,不是 production agent standard。 它评的是"给定一个 function spec, 模型能不能调对"。它不评"在真 OAuth Notion 上完成 task"。两者在能力树上是父子节点关系: BFCL 是 prerequisite,MCP-bench 才是 production-ready。 任何 ≤40B 模型想做 function-call,先在 BFCL 拿到 ≥70%,再去碰 MCP-Atlas / Toolathlon。
  2. V4 的权重重排很激进 — Agentic 直接 40%。 V4 把 single-turn(Live+Non-Live)总占比从 66% 砍到 20%,这意味着过去刷 V3 高分的策略(TOUCAN 1.5M SFT)在 V4 上效用打折。 V4 高分需要 (a) 良好的 multi-turn state 保持, (b) 调用 search/memory API 的连续决策, (c) 拒答 / 追问能力。这正好是 EnvScaler + SETA + AWM 类合成 env 训练的甜点。
  3. ≤40B 当前 ceiling: Qwen3-32B (V3 75.7%)。 但 V4 SOTA 已经被 GLM-4.5 (70.9%) 拿走 — 注意这不是真 open-weight。下一年(2026 H2)的开源 race 大概率是 Qwen3.5 / GLM-4.7-Air / DeepSeek-V4-FC 在 V4 上的争夺。
  4. Claude 在 BFCL 上吃亏 ≠ Claude 不会 tool use。 V3 BFCL 25.3 vs MCP-Atlas 77.3 这种巨大反差,本质是 Anthropic Tool Use API schema 与 BFCL prompt-mode 默认 schema 不兼容。这是 V4 加 format_sensitivity diag 的真正原因。如果你用 Claude 做 tool agent,记得用 claude-opus-4-1-20250805 这个 FC native handler(CHANGELOG #1019)。
  5. BFCL 缺什么 = MCP bench 在补什么。 BFCL 没有的: OAuth / 真 API 限流 / 副作用 / 长 horizon (>10 turn) / 跨 server 调度 / 自然语言 query 含糊度。这恰恰是 MCP-Universe (#25) / MCPMark / Toolathlon / Atlas 各自切的赛道。 BFCL 的留白 = MCP bench 的机会。

10.2 训 ≤40B 模型时 BFCL 该怎么用、不该怎么用

✅ 该怎么用

❌ 不该怎么用

10.3 一句话推荐

训 ≤40B function-call 模型的标准流程: TOUCAN 1.5M SFT 冷启 → BFCL V3 Overall ≥70% gate → EnvScaler/SETA 合成 env RLVR → BFCL V4 Agentic ≥50% gate → MCP-Atlas / MCPMark 真测。 BFCL 是第二关,不是终点,但跨不过它就没法继续。

§11 BFCL-targeted RL / post-train infra

这一章是 2026-05 新加的:既然 BFCL 是 ≤40B function-call 的事实标准 KPI(§9),那"什么框架/数据集/recipe 是直接以 BFCL 为优化目标训出来的?"就成了能直接落地的问题。下面把当前 stack 里所有可考的工件列出来,带 repo URL / license / 验证依据。

11.1 Gorilla 团队本身的训练代码 — 缺席

这是研究里的一个负面发现。Berkeley Gorilla 团队没有发布"训 BFCL"的公开 RL 代码。具体观察:

结论: 想"train against BFCL",必须用第三方框架。下面是当前实际可用的 stack。

11.2 NVIDIA Tool-N1 — 唯一明确"以 BFCL 为目标"的开源 RL 训练 codebase

FieldValue
Repogithub.com/NVlabs/Tool-N1
PaperarXiv:2505.00024 "Nemotron-Research-Tool-N1: Tool-Using LLMs with Reinforced Reasoning"(Zhang et al. 2025)
LicenseApache-2.0
Underlying RL frameworkverl (字节跳动 HybridFlow,GRPO 算法)
SFT frameworkLLaMA-Factory + LoRA
Reward designR1-style 二元 reward: structural validity ∧ functional correctness(无需 CoT 标注)
Eval targetBFCL + APIBank + ACEBench(repo 自带 eval/ + BFCL handlers)
提供物preprocessed RL+SFT datasets · training scripts · eval scripts · model ckpts

核心 claim(paper verbatim): "RL offers a more effective paradigm for enhancing the tool-calling capabilities of LLMs compared to standard supervised fine-tuning", 并在 BFCL 上展示 RL beats SFT。这是第一个 BFCL native 的端到端 RL 实证 codebase

顶级推荐(这次研究我的 top pick): Tool-N1 是当前唯一一个 (1) 明确以 BFCL 为评估目标, (2) 完整开源 (dataset + scripts + ckpt + Apache-2.0), (3) 基于成熟 RL stack (verl + GRPO), (4) 已发表 paper 验证 effectiveness 的工件。比 xLAM (cc-by-nc-4.0, 偏 SFT) 和 Hammer (无 RL) 都齐全。

11.3 Salesforce xLAM + APIGen + ActionStudio — SFT-heavy, 数据规模最大

FieldValue
Repogithub.com/SalesforceAIResearch/xLAM (619 stars, Apache-2.0 license)
核心子目录actionstudio/(EMNLP 2025 main paper, training)+ xLAM/(inference)
数据集 1xlam-function-calling-60k — APIGen 合成,HF top-3 trending 2024-07
数据集 2APIGen-MT-5k — multi-turn function calling,2025-05 公开
PaperarXiv:2504.03601 (APIGen-MT, ICLR 2026 accepted)
Ckpt licensescc-by-nc-4.0(research only) — 关键限制
BFCL on V4(官方 CSV)xLAM-2-32b-fc-r 54.66 · 70b 53.07 · 8b 46.68 · 3b 41.22 · 1b 30.44
RL or SFT主要 SFT(actionstudio 是 unified trajectory training pipeline,不是 RL framework)

用 xLAM 训 BFCL 的最佳路径: (a) 取 xlam-function-calling-60k 做冷启 SFT, (b) 取 APIGen-MT-5k 做 multi-turn SFT, (c) 用 Tool-N1 (§11.2) 做 GRPO RL — 三段式。 这是当前 BFCL multi-turn 类别最佳实证(xLAM-2 在 multi-turn 单维拿到 77.38%,§6.5)。

11.4 MadeAgents Hammer — function-masking SFT, on-device 取向

FieldValue
Repogithub.com/MadeAgents/Hammer (Apache-2.0)
PaperarXiv:2410.04587 "Hammer: Robust Function-Calling via Function Masking"(ICLR 2025 Spotlight)
训练子目录train/ 仅含 data_processing.py未公开完整 trainer,需要自己接 HF trainer
Ckpt licensecc-by-nc-4.0(0.5B/1.5B/7B)· qwen-research(3B)
BFCL on V47B: 31.67 · 3B: 29.71 · 1.5B: 27.88 · 0.5B: 21.22(全部含 Web/Memory = 0)
定位on-device function-calling,V4 加 agentic 后此 model family 影响最大(WebSrch/Memory 几乎全 0)

11.5 verl / SkyRL / NeMo-RL — 通用 RL 框架(可承载 BFCL reward)

框架RepoLicenseBFCL 直接支持?说明
verlverl-project/verlApache-2.0无内置 BFCL recipe; Tool-N1 用它当 backbone字节 HybridFlow,GRPO/PPO/DPO 全支持。2026-05 加了 zero-mismatch HF rollout。 BFCL reward 要自己写 ~100 行 (拿 bfcl_eval 的 AST checker 当 reward fn 即可)。
SkyRLNovaSky-AI/SkyRLApache-2.0无 BFCL recipeUCB NovaSky,modular full-stack。 子项目 SkyRL-Agent (arXiv:2511.16108) 主打 long-horizon multi-turn,理论上 BFCL multi_turn 适配 ok。
NVIDIA NeMo-RLNVIDIA-NeMo/RLApache-2.0无 BFCL recipeNeMo 旗下,主打 DeepScaleR 类 verifiable reward。NeMo-Gym 可作 env 容器。Tool-N1 的 NVlabs 版没用 NeMo-RL,反而用 verl,这是 NVIDIA 内部组织线的有趣信号。
verl-toolTIGER-AI-Lab/verl-toolApache-2.0支持 generic tool, 无 BFCL 特化 recipeverl 的 tool-use fork,981 stars。 把 tool 调用做成 RL env,BFCL-style 任务可直接挂。
OpenPipe ARTOpenPipe/ARTApache-2.0(9.4k stars)无 BFCL example最接近的是 examples/mcp-rl/(MCP server tool use)+ LangGraph 集成。GRPO 全支持,BFCL adaptation 不难但缺 ready-to-run notebook。

11.6 BFCL-targeted RL papers(2025-2026)

Paper核心 idea用 BFCL?开源?
Tool-N1(arXiv:2505.00024, NVIDIA)R1-style 二元 reward + GRPO + verl, 取代 SFT主要 eval targetNVlabs/Tool-N1 Apache-2.0
"Reasoning through Exploration: RL for Robust Function Calling"(arXiv:2508.05118)EGPO — entropy-guided GRPO, CoT entropy 进入 advantageBFCL eval未明示 repo
RC-GRPO(arXiv:2602.03025)Reward-Conditioned GRPO,reward token-conditioned explorationmulti-turn tool calling(含 BFCL)?
R2IF(arXiv:2604.20316)composite reward(format + CER + SMV)+ GRPO, 强调可解释 function callingBFCL eval?
TOUCAN(arXiv:2510.01179)1.5M MCP 合成 trajectory SFT(非 RL)BFCL V3 eval✅ CC-BY 4.0 dataset
APIGen-MT(arXiv:2504.03601)双 agent 模拟生成 multi-turn,xLAM-2-fc-r 训练源BFCL + τ-bench eval✅ APIGen-MT-5k dataset, ckpt cc-by-nc
GPT-5.5 system card "CoT-Control"13K-task CoT 可控性训练集,BFCL 是源 benchmark 之一训练源评测集 YuehHanChen/CoTControl

GPT-5.5 system card verbatim: "CoT-Control includes over 13,000 tasks built from established benchmarks: GPQA, MMLU-Pro, HLE, BFCL and SWE-Bench Verified. Each task is created by pairing a benchmark problem with one CoT instruction such as avoiding certain problem-relevant keywords in CoT, using only lowercase letters, or appending a given word to each sentence." — 这把 BFCL 从"eval"提升到"训练数据源"位置,是 frontier lab 引用 BFCL 的最严肃方式。

11.7 比较矩阵 — Stage × Tool × License × BFCL-specific?

Tool / StackSFTRLDistillLicenseBFCL-targeted?推荐场景
NVlabs/Tool-N1✅ LLaMA-Factory✅ verl+GRPOApache-2.0✅ 显式BFCL-native RL,首选
SalesforceAIResearch/xLAM✅ ActionStudiocode Apache-2.0 / ckpt cc-by-nc✅ 间接(自报 BFCL top)multi-turn SFT 冷启
MadeAgents/Hammer✅(部分)Apache-2.0(code)✅ 间接on-device tiny model
verl-project/verlApache-2.0❌(通用)BFCL RL backbone
NovaSky-AI/SkyRLApache-2.0❌(通用)multi-turn long-horizon
OpenPipe/ART✅ GRPOApache-2.0❌(MCP-RL 接近)fast iteration
TIGER-AI-Lab/verl-toolApache-2.0tool env wrap
NVIDIA-NeMo/RLApache-2.0大规模 cluster
TOUCAN dataset✅(1.5M traj)CC-BY 4.0✅ evalSFT 冷启数据
APIGen-MT dataset✅(5k MT)cc-by-nc-4.0✅ evalmulti-turn 补强

11.8 我的 take — 实战 BFCL 训练 stack 建议

  1. 纯 SFT 起步: Qwen3-8B(Apache-2.0 base)+ TOUCAN-1.5M(冷启 4 epoch)+ APIGen-MT-5k(multi-turn 2 epoch)。 目标:V3 ≥70% → V4 NonLive ≥85% / Live ≥75%。 整套用 HF trainer + LoRA 在 4×A100 一周搞定。
  2. RL 提升:Tool-N1 的 verl+GRPO,reward 用 BFCL bfcl_eval/eval_checker/ast_eval/ 的 AST checker(直接 import,这是 §11 最关键的实操点)+ multi_turn state diff。 注意训 train split 和 eval split 严格隔离 — BFCL 没有官方 train/val split, 所以要自己做 hold-out。
  3. 避坑:
    • 不要用 BFCL 原 ground truth 训 — 那是 eval 集合,等于污染
    • 不要用 xLAM ckpt 商用(cc-by-nc),要商用必须从 base model + dataset 重训
    • 不要只追 BFCL Overall — 拆分看 multi_turn vs irrelevance,常常一升一降(xLAM 案例就是典型)
  4. 下一步(把 BFCL 当 prerequisite): BFCL V4 Agentic ≥50% 后,接 MCP-Atlas / MCPMark(#21),BFCL 不能替代真 MCP eval(§8.2)。

相关阅读