The Decoder October 5, 2024 IDIOCRACY

Why Reflection 70B became a benchmark warning for open-source AI

Reflection 70B was introduced with unusually strong claims, including that it could compete with Claude 3.5 Sonnet and GPT-4o. Independent tests did not support the reported performance, and Matt Shumer later acknowledged that the model "didn't achieve the benchmarks originally reported."

WTF Index IDIOCRACY

◄ Terminator 0 Idiocracy 2 ►

The story mainly highlights inflated AI benchmark claims and the need for verification, pointing to erosion of truth and quality rather than danger or autonomy.

Why Reflection 70B became a benchmark warning for open-source AI

Reflection 70B began as one of the louder open-source AI launches of 2024. It was presented by OthersideAI as a major step forward for language models, with founder Matt Shumer claiming that the system could challenge leading closed models and outperform well-known alternatives on several benchmarks.

The story changed quickly. Independent testing raised doubts, public and private model access appeared to produce different results, and Shumer later acknowledged that Reflection 70B had not met the benchmark claims first attached to it. The case has become a useful reminder that AI performance claims need verification before they become accepted facts.

What OthersideAI said Reflection 70B could do

OthersideAI unveiled Reflection 70B as a new language model built around a method called "reflection tuning." The model was described as based on Llama 3, and Shumer said it was the most capable open-source model available at the time.

The claims went further than a routine model release. Shumer said Reflection 70B could compete with Claude 3.5 Sonnet and GPT-4o. The source article also notes a later framing in which the model was said to compete with top closed systems like Claude 3.5 Sonnet and GPT-4.

Benchmarks were central to the launch. Shumer claimed that Reflection 70B outperformed GPT-4o on MMLU, MATH, IFEval, and GSM8K. He also said it appeared to significantly outperform Llama 3.1 405B.

The release was not meant to be the end of the effort. OthersideAI said it planned to release a larger Reflection 405B model next week, alongside a detailed report on the process and results. The company also had plans for an even larger and more capable model based on LLaMA 3.1 450B this week.

How "reflection tuning" was supposed to help

The most important technical idea behind the launch was "reflection tuning." In the description given by Shumer, the technique teaches a model to handle an answer in two stages rather than moving directly from prompt to final output.

First, the model creates an initial response. Then it checks that response for possible mistakes or inconsistencies before producing a revised answer. The goal is to make the model better at identifying its own errors before users see the final result.

This matters because existing language models often "hallucinate" facts without recognizing the problem. Reflection tuning was presented as a way to help Reflection 70B self-correct, while also separating planning from answer generation.

According to the source, the approach was also intended to improve chain-of-thought prompting while keeping outputs simple and accurate for end users. Glaive AI provided synthetic training data for Reflection, and Shumer credited that role directly in the original launch discussion.

OthersideAI also said it used Lmsys' LLM Decontaminator to check Reflection 70B for overlap with test datasets. That detail was meant to address a common concern in AI benchmarking: if a model has seen material too close to a benchmark, the reported score may not reflect genuine general capability.

Independent tests complicated the launch

The first major problem was that outside results did not match the strongest launch claims. According to the comparison platform Artificial Analysis, Reflection 70B underperformed in benchmarks compared to LLaMA-3.1-70B, which it was supposedly based on.

Shumer responded by saying there had been problems uploading model weights to Hugging Face. He said the uploaded weights were a mix of several different models, while the internally hosted model performed better.

That explanation led to another round of testing. Shumer gave select individuals exclusive access to a model, and Artificial Analysis repeated its test. Those results were better than the public API results, but Artificial Analysis could not confirm which model it had accessed.

New Reflection model weights were later uploaded to Hugging Face. The source says those weights performed significantly worse in tests than the model previously available through the private API.

The controversy also included evidence found by users that the Reflection API was sometimes calling Anthropic Claude 3.5 Sonnet. For an open-source model launch, that kind of uncertainty matters because users and evaluators need to know which system is actually producing an answer.

Why benchmarks were part of the problem

The Reflection 70B episode also exposed a broader concern: benchmark numbers can be fragile signals. Nvidia AI researcher Jim Fan explained, presumably in the context of Reflection 70B, that benchmarks such as MMLU, GSK-8K, and HumanEval can be manipulated easily.

Fan's point was not just that benchmark results can be wrong. The source says models can be trained with paraphrased or newly generated questions that resemble test questions. Timing and additional computing power during inference can also improve scores.

Fan therefore considered those benchmarks unreliable. Instead, he recommended LMSy's Arena chatbot, where humans score LLM results in a blind test, or private benchmarks from third-party providers such as Scale AI. In his view, those approaches are a better path to identifying stronger models.

That argument fits the Reflection 70B timeline. The launch centered on benchmark claims, but the public weights, private access, and subsequent tests did not produce a clear, stable performance picture. The result was not simply a weaker model claim. It was a trust problem around how the claim had been made and checked.

What Shumer acknowledged afterward

By October 05, 2024, Shumer had acknowledged that Reflection 70B had not achieved the originally reported results. In a statement on X, he said the model "didn't achieve the benchmarks originally reported."

He also apologized for the incident and said he would be more careful in the future. At the same time, he did not abandon the underlying idea. Shumer said he planned to keep working on the "reflection tuning" concept because he believed it could move the technology forward.

That leaves Reflection 70B in an unusual position. The launch failed to support its strongest performance claims, but the underlying concept remains under development. The practical lesson is clear: open-source AI needs strong models, but it also needs transparent weights, reproducible testing, and independent verification before major claims can be treated as reliable.