A new benchmark for AI agents offers a sharp reality check for investment banking automation. The test, called BankerToolBench, asked leading models to complete the kinds of deliverables junior bankers prepare every day: Excel models, PowerPoint decks, PDF reports, and Word memos.
The result was not a close call. Around 500 current and former investment bankers reviewed the outputs, and none were considered ready to send directly to a client.
What BankerToolBench Tested
BankerToolBench was released by a research team at Handshake AI and McGill University. Handshake AI is the business arm of the career platform Handshake, which places vetted academics and professionals inside AI labs to help train and evaluate models.
The benchmark was built around 100 tasks intended to mirror junior investment banking work. The team enlisted around 500 current and former investment bankers from firms including Goldman Sachs, JPMorgan, Evercore, Morgan Stanley, and Lazard. Of those, 172 designed the tasks themselves, contributing more than 5,700 hours of work.
Each task took a human banker an average of five hours, with some running up to 21 hours. That matters because these were not short question-and-answer prompts. The models had to produce actual work products that could be checked against professional expectations.
The benchmark covered deliverables such as:
- Excel financial models with working formulas
- PowerPoint decks for client meetings
- PDF reports
- Word memos
The AI agents also had to search through data rooms, use market data platforms including FactSet and Capital IQ, and parse SEC filings. According to the paper, one task can involve up to 539 language model calls, with 97 percent tied to tool use or code execution.
How The Outputs Were Judged
The researchers did not grade the work with a simple pass-or-fail checklist. Each deliverable was reviewed against a banker-designed rubric averaging 150 individual criteria. Those criteria covered six areas, including technical correctness, client readiness, compliance, auditability, and consistency across files.
Grading was handled by Gandalf, an AI verifier built by the authors and based on Gemini 3 Flash Preview. The verifier agreed with human reviewers 88.2 percent of the time. That was slightly higher than the 84.6 percent agreement rate between two human reviewers.
This structure makes the benchmark unusually demanding. It does not only ask whether a model can write plausible text or assemble a clean-looking slide. It checks whether numbers line up, whether formulas work, whether the output follows style requirements, and whether the underlying work can be audited.
GPT-5.4 Led, But Still Fell Short
The team tested GPT-5.2, GPT-5.4, Claude Opus 4.5 and 4.6, Gemini 2.5 Pro, Gemini 3.1 Pro Preview, Grok 4, and the open-source models Qwen-3.5-397B and GLM-5.
GPT-5.4 ranked first, but the lead did not translate into client-ready performance. It still failed nearly half the criteria. Only 16 percent of its outputs met the threshold where bankers would accept them as a useful starting point. When the benchmark required three consistent runs, that fell to 13 percent.
The hardest finding is the simplest one: no output from any model was ready to submit as is. With GPT-5.4, just 2 percent of tasks passed every critically weighted criterion. With Gemini 2.5 Pro, the figure was zero.
Still, the benchmark does not suggest the work was useless. More than half of the bankers said they would use the output as a starting point. That distinction is important. The current value appears to be draft assistance, not autonomous production of materials that can go straight to a client.
Where The Models Broke Down
The failures were not always obvious on the surface. Claude Opus 4.6 produced outputs that researchers said looked polished at first glance. But the Excel models exposed a core problem: many key numbers were hardcoded as fixed values instead of calculated with formulas.
For investment banking, that is a serious flaw. If a model cannot update when an assumption changes, it cannot support scenario analysis. The paper gives a plain example: change the purchase price in the model, and nothing updates. Claude Opus 4.5 showed the same weakness.
An analysis of GPT-5.4 agent trajectories found four recurring failure modes. The most common, at 41 percent, involved bugs in code and formula generation. In some cases, agents called python-pptx functions that do not exist, then deleted the broken line rather than fixing the underlying problem.
Other errors showed up in the business logic. In 27 percent of cases, the model applied reasoning incorrectly, such as adding cost synergies to the revenue line instead of to costs. Another 18 percent of errors came from aborted data queries. In 13 percent of cases, agents fabricated missing numbers and presented them as sourced.
The paper also highlights subtle inconsistencies. In one generated deck, the verifier identified a revenue figure of $189.5 billion on one slide and $201.0 billion on another slide for the same period. In another case, an agent used Netflix red as an accent color even though the bank's style guide required uniform blue.
A more serious example came from a competitive analysis for a pharma deal, where an agent fabricated specific clinical trial data after failing to find it in the SEC database.
What This Means For AI Agents In Finance
The benchmark shows a clear pattern. The models generally did better on PowerPoint tasks than on Excel work. The toughest areas were debt capital markets, merger models, and capital structure tables.
The researchers attribute part of the gap to missing domain knowledge. When tasks were enriched with the kind of context bankers normally assume, scores rose significantly. That suggests better tools and better context may improve performance, but it does not change the current conclusion.
BankerToolBench can also be used for reinforcement learning. In experiments with Qwen-3-4B and 32B, Dr. GRPO and DPO improved benchmark performance by a factor of five to thirteen, though from a very low baseline.
The authors also note limits. The benchmark is US-focused, lacks confidential deal information, and does not capture the iterative teamwork inside a real bank. Even with those limits, it is a detailed test of whether AI agents can handle demanding knowledge work in finance.
For now, the evidence points to a narrow role. AI can help draft, organize, and begin investment banking work. It cannot yet replace the review, judgment, and accountability required before that work reaches a client.