Private judgment helped Qwen3-235B outperform GPT in finance

Bridgewater and Thinking Machines Lab say a fine-tuned open-weight AI model beat leading commercial models on internal finance document tests. The result points to the value of proprietary examples and expert judgment that are not available in public training data.

WTF Index NEUTRAL
◄ Terminator 1 Idiocracy 1 ►

This is mostly a business and evaluation story about domain fine-tuning, with only mild implications for relying on AI judgment in finance.

Private judgment helped Qwen3-235B outperform GPT in finance

Bridgewater and Thinking Machines Lab say their finance-focused AI work shows a simple but important point: the most useful answers are not always sitting in public data. In their internal evaluation, a fine-tuned open-weight model performed better than leading commercial models at judging financial documents, while costing far less to operate.

What the test was trying to measure

The project focused on a familiar problem for investors. They face a constant flow of news, analysis, corporate filings, and emails. The report from Bridgewater's AIA Labs and Thinking Machines Lab says the hard part is not only reading that material, but deciding what matters.

The researchers defined six tasks based on an investor's daily routine. One task involved deciding whether a financial article was relevant to an executive. Another asked whether a central bank document indicated the direction of future rate changes.

These are not always easy rules to write down. The source gives an example: a headline about Trump's claim to Greenland was treated as irrelevant, while Trump's threat of new China tariffs was highly relevant. Both involve geopolitics and finance, but the investor judgment is different.

Why frontier models struggled

In the authors' tests, variants of Gemini, Claude, and GPT reached only about 50 percent accuracy with a basic prompt. That improved when the researchers added expert-written instructions and a three-tier rating system: "relevant and interesting," "relevant but uninteresting," and "irrelevant."

Even with that added structure, accuracy moved only into the mid-70s. The authors had set an 80 percent threshold for trustworthy deployment, so the prompted frontier models still did not meet the bar.

The report also says newer models did not deliver much more value per dollar in this setup. GPT 5.4 costs 43 percent more than 5.2 but is only marginally more accurate, according to the report.

The value was in proprietary examples

Bridgewater and Thinking Machines Lab turned to fine-tuning. Instead of relying only on broad commercial models, they retrained an open-weight model using proprietary examples that reflected Bridgewater investors' judgment.

The labeling process mattered. At first, cheap outside contractors labeled documents, but many labels were wrong. Having expensive professionals review everything would have been costly, so the researchers used a narrower correction process.

A first model learned from the flawed labels and then re-evaluated the same documents. When the model and the original label disagreed, that case was more likely to contain an error. Only those disputed cases went to investors for correction.

This approach concentrated expert attention where it was most useful. It also turned internal judgment, which can be difficult to explain in formal rules, into training data for the model.

What Qwen3-235B achieved

The training ran on the Tinker platform from Thinking Machines Lab, built on top of the open model Qwen3-235B. In the team's own evaluation, the fine-tuned model reached 84.7 percent accuracy. The best frontier model tested reached 78.2 percent.

The fine-tuned model also cost nearly 14 times less to run. The source notes an important limitation: this was not a truly independent comparison, and both companies have an interest in selling their product.

Still, the broader result is significant. The work suggests that strong AI performance in specialized business tasks may depend less on generic model size and more on access to the right private examples.

Why this matters for companies

The report points to a larger implication for AI strategy. Large AI labs have not absorbed every useful dataset. Companies still hold proprietary corporate data and untrained human expertise that can improve performance in narrow, valuable tasks.

That is especially relevant when the data is sensitive. The source notes that companies may not want to share their most valuable data with a frontier lab, because they risk competing against a product built on top of it.

Fine-tuning open models through tools like Tinker offers another route. Companies can keep the weights, the data, and, depending on the setup, the GPUs themselves.

The lesson is not that frontier models are weak. It is that in specialized domains such as financial document analysis, the missing ingredient may be private judgment. When that judgment is captured carefully, an open-weight model can become more useful than a more general commercial system.