BizFinBench.v2:

A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Xin Guo1,2,* , Rongjunchen Zhang1,*,♠, Guilong Lu1, Xuntao Guo1, Jia Shuai1, Zhi Yang2, Liwen Zhang2,♠
1HiThink Research, 2Shanghai University of Finance and Economics
*Co-first authors, Corresponding author
zhangrongjunchen@myhexin.com & zhang.liwen@shufe.edu.cn

Abstract

Large language models have undergone rapid evolution, emerging as a pivotal technology for intelligence in financial operations. However, existing benchmarks are often constrained by pitfalls such as reliance on simulated or general-purpose samples and a focus on singular, offline static scenarios. Consequently, they fail to align with the requirements for authenticity and real-time responsiveness in financial services, leading to a significant discrepancy between benchmark performance and actual operational efficacy. To address this, we introduce BizFinBench.v2, the first large-scale evaluation benchmark grounded in authentic business data from both Chinese and U.S. equity markets, integrating online assessment. We performed clustering analysis on authentic user queries from financial platforms, resulting in eight fundamental tasks and two online tasks across four core business scenarios, totaling 29,578 expert-level Q&A pairs. Experimental results demonstrate that ChatGPT-5 achieves a prominent 61.5% accuracy in main tasks, though a substantial gap relative to financial experts persists; in online tasks, DeepSeek-R1 outperforms all other commercial LLMs. Error analysis further identifies the specific capability deficiencies of existing models within practical financial business contexts. BizFinBench.v2 transcends the limitations of current benchmarks, achieving a business-level deconstruction of LLM financial capabilities and providing a precise basis for evaluating efficacy in the widespread deployment of LLMs within the financial domain.

Overview


• Overall Model Performance

The complete model ranking results are shown below. (Average Score)

Teaser

• Statistics

Teaser

Teaser

Results on BizFinBench.v2


From an overall performance perspective, ChatGPT-5 ranks first among all participating models with an average accuracy of 61.5%, highlighting its comprehensive competitive advantage in financial scenarios. Furthermore, proprietary models such as Gemini-3 and Doubao-Seed-1.6 also demonstrate excellent performance, ranking within the top three across multiple tasks. Within the category of open-source models, Qwen3-235B-A22B-Thinking-2507 emerges as the top-performing model with an average accuracy of 53.3%. In contrast, the leading financial model, Dianjin-R1, achieves an average accuracy of only 35.7%, trailing Qwen3-32B by a margin of 5.6%. We analyze this discrepancy from two perspectives: first, the training data for financial-specific models is primarily derived from open-source financial datasets centered on financial knowledge and simulated scenarios, which fail to map the complex characteristics of real-world financial business environments; second, although Dianjin-R1 incorporates customer service Q\&A data, its business coverage remains narrow, making it difficult to adapt to more volatile and long-context practical business scenarios.

Teaser

Online Evaluation


Teaser

Case Study


• Counterfactual Inference

Teaser

Citation


        Coming Soon