Why OpenAI's o3 benchmark gap matters for AI testing

Epoch AI found that OpenAI's public o3 model scored around 10% on FrontierMath, below the over 25% figure OpenAI discussed in December. The difference appears tied to compute settings, testing setup, benchmark version, and the fact that the released o3 was optimized for product use.

WTF Index IDIOCRACY
◄ Terminator 0 Idiocracy 1 ►

The story mainly highlights confusing benchmark claims and evaluation quality issues rather than dangerous autonomy or harm.

Why OpenAI's o3 benchmark gap matters for AI testing

OpenAI's o3 is still a strong reasoning model, but its public benchmark story has become more complicated. Independent FrontierMath results from Epoch AI show a lower score than the headline figure OpenAI presented when it introduced the model in December, adding another example of why AI benchmark claims need careful reading.

What Changed Between The Demo And The Release

When OpenAI unveiled o3 in December, the company said the model could answer just over a fourth of questions on FrontierMath, a difficult set of math problems. That stood far above the next-best model, which managed to answer only around 2% of FrontierMath problems correctly.

During a livestream, Mark Chen, chief research officer at OpenAI, described the internal result this way: "Today, all offerings out there have less than 2% [on FrontierMath]," and added, "We're seeing [internally], with o3 in aggressive test-time compute settings, we're able to get over 25%."

The key phrase is "aggressive test-time compute settings." Based on later results, that over 25% figure appears to have been an upper-end result from a version of o3 using more computing resources than the public model OpenAI launched last week.

Epoch AI's FrontierMath Result

Epoch AI, the research institute behind FrontierMath, released its independent tests of o3 on Friday. Its result: o3 scored around 10% on FrontierMath, below OpenAI's highest claimed score.

That gap does not automatically mean OpenAI's original presentation was false. The source article notes that OpenAI's December benchmark materials also included a lower-bound score that matches what Epoch observed. Epoch also said its evaluation setup likely differed from OpenAI's, and that it used an updated release of FrontierMath.

Epoch pointed to several possible reasons for the difference. OpenAI may have evaluated o3 with a more powerful internal scaffold, used more test-time computing, or run its result on a different FrontierMath subset: the 180 problems in frontiermath-2024-11-26 rather than the 290 problems in frontiermath-2025-02-28-private.

For readers trying to compare AI models, those details matter. A benchmark score is not just a property of a model name. It can depend on the model version, the amount of compute used at test time, the surrounding evaluation system, and the exact benchmark set.

The Public o3 May Not Be The Same System

The ARC Prize Foundation, which tested a prerelease version of o3, also indicated that the public release differs from the system it had evaluated. According to a post on X from the organization, the public o3 model "is a different model […] tuned for chat/product use."

ARC Prize also wrote that "All released o3 compute tiers are smaller than the version we [benchmarked]." In general, larger compute tiers can be expected to perform better on benchmarks, so a smaller public configuration may reasonably produce different results.

ARC Prize said it would re-test released o3 on ARC-AGI-1 and would relabel its earlier reported results as "preview." Those earlier preview results were listed as o3-preview (low): 75.7%, $200/task and o3-preview (high): 87.5%, $34.4k/task, using o1 pro pricing.

OpenAI's own Wenda Zhou, a member of the technical staff, also said during a livestream last week that the o3 in production is "more optimized for real-world use cases" and speed than the version shown in December. He said this may create benchmark "disparities."

Zhou explained that OpenAI had made optimizations to make the model "more cost-efficient" and "more useful in general." He also said users "won't have to wait as long" when asking for an answer, which he described as a real issue with these types of models.

Why This Matters Beyond One Score

The immediate practical impact may be limited. The source article notes that OpenAI's o3-mini-high and o4-mini models outperform o3 on FrontierMath, and that OpenAI plans to debut a more powerful o3 variant, o3-pro, in the coming weeks.

Still, the episode highlights a broader issue in AI model evaluation. Benchmark scores are often used as shorthand for model quality, but the number alone can hide important context.

  • Model identity: A preview, internal, or product-tuned version may not be the same system users can access.
  • Compute level: More test-time computing can change performance, especially for reasoning models.
  • Benchmark version: Different problem sets can make direct comparisons harder.
  • Evaluation setup: Internal scaffolds and testing methods can affect the final result.

That does not make benchmarks useless. It means they need labels that clearly explain what was tested, under what conditions, and whether the tested system is the same one available to customers or developers.

A Larger Pattern In AI Benchmarking

The o3 discussion fits into a wider pattern of AI benchmark disputes. As companies compete for attention around new models, benchmark charts and headline numbers have become central to launch narratives.

The source article points to several recent examples. In January, Epoch was criticized for waiting to disclose funding from OpenAI until after OpenAI announced o3, and many academics who contributed to FrontierMath were not informed of OpenAI's involvement until it became public.

More recently, Elon Musk's xAI was accused of publishing misleading benchmark charts for Grok 3. Just this month, Meta admitted to promoting benchmark scores for a version of a model that differed from the one it made available to developers.

The lesson is straightforward: AI benchmark results are most useful when they come with enough context to compare like with like. For o3, the public model's around 10% FrontierMath score does not erase OpenAI's internal result, but it does show why a model launch number should not be treated as the whole story.