Benchmarks are supposed to make AI progress measurable. A new international study suggests that many of the tests used to compare large language models are not strong enough to carry that weight.
After reviewing 445 benchmark papers from ICML, ICLR, NeurIPS, ACL, NAACL and EMNLP, covering work from 2018 to 2024, researchers found widespread problems in how LLM benchmarks are designed, sampled and interpreted. The review involved 29 expert reviewers and concluded that nearly every benchmark had at least one major weakness.
Why benchmark validity matters
A benchmark is only useful if it measures the thing it claims to measure. For large language models, that means a high score should reflect the specific capability under test, not a side effect of wording, data selection or task format.
The study frames this as a validity problem. If a benchmark says it tests reasoning, alignment or security, those terms need clear boundaries. Without them, a result can look precise while still being hard to interpret.
The authors write, "Almost all articles have weaknesses in at least one area." That finding does not mean benchmarks are useless. It means their scores often need more caution than the AI field gives them.
Vague definitions make scores harder to trust
The review found that 78 percent of benchmarks define what they are trying to measure. But almost half of those definitions are described as vague or controversial. Terms such as reasoning, alignment and security are often treated as obvious when they are not.
This creates a basic interpretation problem. If a test does not clearly define the target skill, a model score cannot cleanly show whether the model has that skill. It may instead reflect a mixture of abilities that were never separated.
That issue shows up in composite skills. About 61 percent of benchmarks test combined abilities such as agentic behavior. The source article describes this as involving both recognizing intent and producing structured output. When those sub-skills are not evaluated separately, it becomes difficult to know what a model actually did well.
Task realism is another concern. The study found that 41 percent of benchmarks use artificial tasks, while 29 percent rely only on artificial tasks. Only about 10 percent use real-world tasks that reflect practical model use. That gap matters because a model can perform well in a narrow test setting without proving that it will behave the same way in use cases outside the benchmark.
Sampling and recycled data weaken comparisons
The study also points to problems in how benchmark datasets are chosen. About 39 percent of benchmarks rely on convenience sampling, and 12 percent use it exclusively. In plain terms, that means some benchmarks use data because it is easy to obtain, not because it represents the situations the benchmark claims to cover.
Data reuse is another widespread issue. Around 38 percent of benchmarks draw from human tests or existing sources, and many also depend heavily on datasets from other benchmarks. Reuse is not automatically wrong, but the study argues that authors should disclose it and explain the limits it creates.
The source gives a math example. If a benchmark takes questions from a calculator-free exam, the numbers may have been selected because they allow simple arithmetic. A model that does well on that material may not have shown the same ability on more difficult calculations.
The larger concern is that benchmark data can become too familiar or too narrow. If test items overlap with material in a model’s training data, results may look stronger than they should. The researchers say contamination testing should be used to check whether benchmark items appear in training data, and hidden test sets should be kept secure.
Statistics are often too thin
Even when the task and data are well chosen, the scoring method can still leave major uncertainty. The review found that over 80 percent of benchmarks use exact match scores. Only 16 percent apply statistical tests when comparing models.
Exact match scoring can be simple and useful, but the study argues that meaningful comparisons need stronger statistical methods and clear uncertainty estimates. Without them, small-looking differences between models may be treated as more meaningful than the evidence supports.
Alternative evaluation methods are still uncommon. Just 17 percent of benchmarks use LLMs as judges, and only 13 percent rely on human judgment. The source article notes that most benchmarks skip uncertainty estimates and statistical tests entirely, leaving major gaps in reliability.
The researchers also call for both quantitative and qualitative error analysis. Counting errors is useful, but studying the kinds of errors can reveal recurring weaknesses that a single score might hide.
What stronger LLM benchmarks would require
The study’s recommendations are direct. A benchmark should clearly state what it measures and where its boundaries are. It should focus on the target skill without mixing in unrelated tasks or formats that make results harder to interpret.
Dataset selection should be deliberate. If benchmark authors reuse data, they should say so and explain the consequences. They should also test for contamination and protect hidden test sets so model comparisons remain fair.
The researchers point to GSM8K as an example. They say GSM8K is meant to test math reasoning with grade-school arithmetic, but it also mixes in reading comprehension and logic skills without evaluating those skills separately. That makes a model’s score less straightforward than it may appear.
The Llama 4 controversy is presented as another warning. Meta’s new models initially did well on user benchmarks, although they failed badly on long-context tasks. Meta later admitted to using a specially tuned chat version for the LMArena benchmark, optimized for human judges. The example shows how benchmark results can be shaped by the way a model is prepared for a test.
Benchmarks remain central to AI research because they give researchers a shared way to compare systems and track change. But the study’s message is that weak benchmarks can blur the line between real progress and test-specific performance. For LLM benchmarks to support credible AI progress metrics, they need clearer definitions, better data practices, stronger statistics and more transparency.