Why AI benchmarks are failing to prove real progress

AI companies often use benchmarks to show that new models are improving. Researchers argue many popular tests are outdated, hard to reproduce, or poorly matched to the risks they are supposed to measure.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 2 ►

The story mainly highlights weak benchmarks distorting claims of progress and eroding trustworthy evaluation, with only mild safety-governance implications.

Why AI benchmarks are failing to prove real progress

AI model launches often arrive with benchmark scores that appear to show clear progress. The numbers can look precise, but new research argues that many of the tests behind those numbers are weak foundations for judging what models can really do.

That weakness matters because benchmarks are no longer just marketing material. They are becoming part of how governments and safety organizations decide which AI systems deserve closer scrutiny.

Why benchmark scores matter now

A benchmark is a test for an AI model. Some benchmarks use a multiple-choice format, such as the Massive Multitask Language Understanding benchmark, known as the MMLU. Others evaluate whether a model can perform a specific task or judge the quality of its answers to a fixed set of questions.

AI companies frequently point to benchmark performance when presenting a new model. OpenAI’s GPT-4o, launched in May, was introduced with results showing it ahead of other companies’ latest models across several tests.

The problem, according to the research, is that many popular benchmarks are poorly designed, difficult to replicate, and built around metrics that can be arbitrary. Anka Reuel, an author of the paper, describes the current environment as lacking strong evaluation standards.

Anna Ivanova, professor of psychology at the Georgia Institute of Technology and head of its Language, Intelligence, and Thought (LIT) lab, says developers tend to optimize for the specific benchmarks used to judge their systems. That means the test itself can shape what companies build toward.

The reproduction problem

Reuel and her colleagues began by trying to verify benchmark results released by model developers. Often, they could not reproduce those results.

Part of the issue is practical. To run a benchmark on a model, evaluators usually need instructions or code. Many benchmark creators do not make that code publicly available. In other cases, the available code is outdated.

There is also a tension around the questions and answers in benchmark datasets. If creators publish them, companies could train models on the test material. That would make the benchmark less like an independent exam and more like a test where the answers were already visible.

But keeping the data hidden creates another problem: outsiders have less ability to evaluate whether the benchmark is strong, fair, or measuring what it claims to measure. The result is a system where important claims can be hard to check.

When tests stop measuring progress

Another weakness is saturation. A benchmark becomes saturated when the problems on it have largely been solved, making it less useful for distinguishing between newer and stronger models.

The source article gives a simple example: one generation of a model scores 20% on a test, the next scores 90%, and the third scores 93%. That pattern could make it look as though progress has slowed. Another explanation is that the test has stopped capturing the difference between the second and third systems.

This is especially important when benchmark results are used to compare frontier models. A saturated test can still produce a ranking, but the ranking may say less about meaningful capability than the numbers suggest.

There is a safety concern too. Reuel warns that poorly designed benchmarks can create a false sense of safety, especially in high-stakes settings. A model may appear safe under a benchmark while still failing in ways the benchmark does not test.

Regulation is raising the stakes

Benchmarks already appear in government plans for AI oversight. The EU AI Act, which goes into force in August 2025, references benchmarks as a tool for deciding whether a model demonstrates “systemic risk.” If it does, the model faces higher levels of scrutiny and regulation.

The UK AI Safety Institute also references benchmarks in Inspect, its framework for evaluating the safety of large language models.

That makes benchmark quality more than an academic concern. If a test helps determine how a model is regulated, then the test needs to be reliable, relevant, and clear about what it measures.

The Stanford researchers launched BetterBench, a website that ranks popular AI benchmarks. Its rating factors include whether experts were consulted during design, whether the tested capability is well defined, whether there is a feedback channel, and whether the benchmark has been peer-reviewed.

The MMLU benchmark received the lowest ratings. Dan Hendrycks, director of CAIS, the Center for AI Safety, and one of the creators of MMLU, disagreed with the rankings. He also said the best way to move the field forward is to build better benchmarks.

What better AI evaluation could look like

Several researchers agree that stronger benchmarks are needed, even if they disagree about the best criteria. Marius Hobbhahn, CEO of Apollo Research, says implementation and documentation criteria are valuable, but the central question is whether the benchmark measures the right thing.

That distinction matters. A benchmark can be well documented and still be poorly matched to the concern it is supposed to address. A test of Shakespeare sonnet analysis, for example, would not answer concerns about AI hacking capabilities.

Amelia Hardy, another author of the paper and an AI researcher at Stanford University, points to moral reasoning as an area where definitions can be unclear. She asks whether experts in the relevant domain are being included in the process, and says that often they are not.

Some organizations are trying to improve benchmark design. Epoch AI created a new benchmark with input from 60 mathematicians and verification by two winners of the Fields Medal. The most advanced models currently answer less than 2% of its questions. Tamay Besiroglu, associate director at Epoch AI, says the group tried to represent the full breadth and depth of modern math research, while speculating that models may saturate the benchmark in around four years by scoring higher than 80%.

CAIS is also collaborating with Scale AI on Humanity’s Last Exam, or HLE. Hendrycks says HLE was developed by a global team of academics and subject-matter experts and contains unambiguous, non-searchable questions requiring a PhD-level understanding to solve.

The central point is not that benchmarks are useless. It is that they carry more weight than their current design often supports. If benchmark scores guide company strategy, public confidence, and regulation, then the field needs a much clearer understanding of what a good benchmark is supposed to prove.