Why crowdsourced AI benchmarks are under new scrutiny

AI labs including OpenAI, Google, and Meta have used crowdsourced benchmarking platforms to test upcoming models and promote strong results. Experts cited by TechCrunch say these rankings can be useful, but they also warn that preference votes, unpaid evaluation work, and lab-controlled submissions can make the results easy to overstate.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 2 ►

The story centers on AI evaluation and marketing practices that may degrade truth, quality, and public understanding rather than on autonomous danger.

Why crowdsourced AI benchmarks are under new scrutiny

Crowdsourced AI benchmarks have become a visible part of how new models are tested, ranked, and marketed. Platforms such as Chatbot Arena ask people to compare outputs from anonymous models, then turn those judgments into public signals that labs can cite when they claim progress.

That process is now facing sharper criticism. Experts quoted by TechCrunch say the method can bring in useful outside perspectives, but they also argue that it can be weak as a benchmark, unclear as evidence, and ethically complicated when volunteer labor helps create value for large AI labs.

Why labs like public model rankings

AI labs are increasingly using crowdsourced benchmarking platforms to examine the strengths and weaknesses of their latest systems. OpenAI, Google, and Meta are among the labs that have turned to services that recruit users to help evaluate upcoming models.

The attraction is straightforward. A public leaderboard gives a simple signal: one model appears to perform better than another. When the result is favorable, the lab behind the model can point to that score as proof of a meaningful improvement.

Chatbot Arena is one of the main examples in the debate. Its process asks volunteers to prompt two anonymous models and choose the response they prefer. The model identities are hidden during the comparison, and the aggregate judgments contribute to how models are ranked.

For an industry moving quickly, that kind of open testing has obvious appeal. It can expose models to many prompts and many users, and it can make evaluation feel less closed than internal testing. But the same openness creates questions about what, exactly, the results prove.

The academic concern: what is being measured?

Emily Bender, a University of Washington linguistics professor and co-author of the book "The AI Con," argues that a benchmark needs to be clear about what it measures. In her view, Chatbot Arena has not shown that choosing one answer over another reliably maps to a defined idea of preference.

That matters because a benchmark is only useful if its score connects to a well-defined target. A vote can show that a person selected one output in one comparison. It does not automatically explain whether the model is more accurate, safer, more useful for work, better for a specific field, or simply more appealing in that moment.

This is the difference between a popularity signal and a rigorous evaluation. A public preference ranking may reveal something important about how a community reacts to model outputs. But experts in the article warn that treating the same ranking as broad proof of model quality can stretch the result beyond what the process supports.

Asmelash Teka Hadgu, the co-founder of AI firm Lesan and a fellow at the Distributed AI Research Institute, said benchmarks like Chatbot Arena are being "co-opted" by AI labs to "promote exaggerated claims." His concern is not just that the benchmarks are imperfect, but that the scores can become marketing evidence without enough context.

The Meta Maverick dispute shows the risk

The article points to a recent controversy involving Meta’s Llama 4 Maverick model. Meta fine-tuned a version of Maverick to score well on Chatbot Arena, but released a worse-performing version instead.

That discrepancy highlights a central problem for public AI benchmarks: users, researchers, and customers may assume a leaderboard score represents the model they can actually access. If a tested version and a released version differ, the public ranking can become harder to interpret.

Wei-Lin Chiang, an AI doctoral student at UC Berkeley and one of the founders of LMArena, which maintains Chatbot Arena, said the Maverick issue was not a flaw in Chatbot Arena’s design. Chiang said it resulted from labs misinterpreting its policy.

LMArena has responded by updating its policies to "reinforce our commitment to fair, reproducible evaluations," according to Chiang. The organization’s stated aim is to provide an open space that measures community preferences about different AI models.

That response shows the tension around these platforms. Benchmark creators want openness, community feedback, and useful comparison. Critics want stronger safeguards so that rankings cannot be used in ways that imply more than the test can support.

Volunteer work raises ethical questions

The debate is not only technical. Hadgu and Kristine Gloria, who formerly led the Aspen Institute’s Emergent and Intelligent Technologies Initiative, argue that people who evaluate models should be paid for their work.

Gloria said AI labs should learn from the data labeling industry, which is described in the source as notorious for exploitative practices. Some labs have also been accused of similar practices. Her point is that crowdsourced benchmarking can resemble citizen science, but it should not become a way to extract valuable evaluation work without fair treatment.

Gloria also said crowdsourced benchmarking can be valuable because it brings in additional perspectives for evaluation and fine-tuning. The problem, in her view, is relying on benchmarks as the only measure. As the industry and innovation move quickly, benchmarks can rapidly become unreliable.

Matt Fredrikson, the CEO of Gray Swan AI, offered a related but more operational view. Gray Swan AI runs crowdsourced red teaming campaigns for models, and Fredrikson said volunteers come to the platform for reasons that include "learning and practicing new skills." Gray Swan also awards cash prizes for some tests.

Even so, Fredrikson acknowledged that public benchmarks "aren’t a substitute" for "paid private" evaluations. He said developers also need internal benchmarks, algorithmic red teams, and contracted red teamers who can take a broader approach or bring domain expertise.

What better AI evaluation could look like

The experts in the source do not dismiss open testing entirely. Instead, they argue for a more layered approach. Crowdsourced rankings can be one input, but not the whole case for whether a model is better, safer, or ready for a particular use.

Hadgu suggested that benchmarks should be dynamic rather than static datasets. He also said they should be distributed across multiple independent entities, including organizations or universities, and tailored to distinct use cases such as education, healthcare, and other fields where practicing professionals use the models for work.

Alex Atallah, the CEO of model marketplace OpenRouter, also said open testing and benchmarking alone "isn’t sufficient." OpenRouter recently partnered with OpenAI to grant users early access to OpenAI’s GPT-4.1 models.

Chiang made a similar point from inside the Chatbot Arena ecosystem, saying LMArena supports the use of other tests. The platform’s goal, according to Chiang, is to measure its community’s preferences about AI models in a trustworthy, open space.

The practical takeaway is narrower than the public hype around leaderboards often suggests. Crowdsourced AI benchmarks can help reveal how people compare model responses. But on their own, they cannot settle every question about quality, reliability, domain performance, or real-world readiness.

For model developers and benchmark creators, Fredrikson said clear communication is essential, especially when results are challenged. That may be the core issue: public rankings are useful only when readers understand what they measure, what they leave out, and whether the tested model is the one people will actually use.