A new study has put one of AI’s most watched benchmarks under pressure. The paper, from Cohere, Stanford, MIT, and Ai2, alleges that LM Arena gave some leading AI companies a path to improve their standing on Chatbot Arena while other firms did not receive the same opportunity.
The dispute matters because Chatbot Arena has become a prominent way for companies to signal model quality. If access to testing, sampling, or score publication is uneven, the leaderboard becomes more than a neutral scoreboard. It becomes part of the competition itself.
What The Study Claims
The paper says LM Arena allowed companies including Meta, OpenAI, Google, and Amazon to privately test several model variants on Chatbot Arena. According to the authors, the lowest-performing results did not have to be published, which could make it easier for a company to present only a stronger model to the public leaderboard.
Sara Hooker, Cohere’s VP of AI research and co-author of the study, told TechCrunch that only some companies were aware this private testing was available. She also said the amount of private testing was much higher for some companies than for others.
"This is gamification."
The most specific example in the article involves Meta. The authors allege that Meta privately tested 27 model variants on Chatbot Arena between January and March before the Llama 4 release. At launch, Meta publicly revealed the score of a single model, and that model ranked near the top of the Chatbot Arena leaderboard.
The paper’s authors began their research in November 2024 after hearing that some AI companies might have been receiving preferential access. They measured more than 2.8 million Chatbot Arena battles over a five-month stretch.
Why Chatbot Arena Carries Weight
Chatbot Arena was created in 2023 as an academic research project out of UC Berkeley. It ranks AI models through a crowdsourced process: users see responses from two different models side by side and choose the better answer. Those matchups, described as model battles, help determine each model’s score and leaderboard placement over time.
The format has made Chatbot Arena useful to AI companies, researchers, and observers because it reflects human preference rather than only a fixed technical test. It is also common for unreleased models to appear under pseudonyms, meaning public users may be judging systems before the company behind them has officially introduced them.
That structure makes the rules around access especially important. If a model can appear in more battles, receive more feedback, or run multiple private versions before a public release, the benchmark can influence how companies tune and present their systems.
The Sampling Rate Dispute
The study also alleges that LM Arena allowed some companies, including Meta, OpenAI, and Google, to collect more data by having their models appear in a larger number of battles. The authors argue that this increased sampling rate created an unfair advantage.
According to the source article, the researchers said additional data from LM Arena could improve a model’s performance on Arena Hard, another LM Arena benchmark, by 112%. LM Arena responded in a post on X that Arena Hard performance does not directly correlate to Chatbot Arena performance.
The study has a clear limitation. It relied on self-identification to determine which models were privately testing on Chatbot Arena. The authors prompted AI models about their company of origin and used the answers to classify them. That method is not foolproof.
Even with that limitation, Hooker said LM Arena did not dispute the preliminary findings when the researchers shared them. LM Arena, however, has publicly rejected several claims in the paper and pointed to its own blog post, which says models from non-major labs appear in more Chatbot Arena battles than the study suggests.
LM Arena’s Response And What May Change
LM Arena Co-Founder and UC Berkeley Professor Ion Stoica told TechCrunch by email that the study contained "inaccuracies" and "questionable analysis." LM Arena also said it is committed to fair, community-driven evaluations and invites all model providers to submit more models for testing and improve performance on human preference.
In a statement to TechCrunch, LM Arena argued that one provider submitting more tests than another does not automatically mean the second provider was treated unfairly.
The paper’s authors want LM Arena to make several changes. They say the organization could set a clear and transparent limit on private tests and publicly disclose scores from those tests. LM Arena rejected that suggestion in a post on X, saying it has published information on pre-release testing since March 2024 and that it makes no sense to show scores for pre-release models that the AI community cannot test for itself.
The researchers also suggested changing the sampling rate so all models appear in the same number of battles. LM Arena has been more receptive to that idea and indicated that it will create a new sampling algorithm.
The debate follows earlier scrutiny around Meta’s Llama 4 models. TechCrunch reported that Meta optimized one Llama 4 model for “conversationality,” which helped it perform well on Chatbot Arena, but the company did not release that optimized version. The vanilla version later performed much worse on Chatbot Arena. At the time, LM Arena said Meta should have been more transparent in how it approached benchmarking.
The timing also raises the stakes for LM Arena. Earlier this month, the organization announced it was launching a company with plans to raise capital from investors. The study therefore adds pressure at a moment when the benchmark’s role, governance, and relationship with major AI labs are already drawing attention.