The Decoder April 27, 2025 NEUTRAL

Why OpenAI's o3 benchmark drop matters for AI reasoning

A recent ARC Prize Foundation analysis found that OpenAI's released o3 model performs well below the earlier o3-preview on ARC-AGI-1. The results also show that more reasoning effort can raise cost without reliably improving accuracy, while ARC-AGI-2 remains largely unsolved.

WTF Index NEUTRAL

◄ Terminator 0 Idiocracy 1 ►

The story mainly concerns benchmark reliability and current reasoning limits, with only a mild lean toward overdependence on flawed AI evaluation claims.

Why OpenAI's o3 benchmark drop matters for AI reasoning

OpenAI's o3 is still one of the strongest publicly tested models in the ARC Prize Foundation's evaluations, but the picture is more complicated than early expectations suggested. A recent analysis found a large gap between the released o3 model and the earlier o3-preview version tested in December 2024.

The findings matter because ARC-AGI is designed to probe reasoning skills that are easy for many humans but still difficult for AI systems. They also show why benchmark results for unreleased models need careful interpretation.

What ARC-AGI Is Testing

The ARC Prize Foundation is a nonprofit group focused on AI evaluation. It uses open benchmarks such as ARC-AGI to examine the distance between human reasoning and current artificial intelligence systems.

ARC-AGI is not just a general knowledge test. It is built around symbolic reasoning, multistep composition, and rules that depend on context. These are areas where humans often perform without special training, while AI models still show clear limits.

In the recent analysis, the Foundation tested two models: o3 and o4-mini. Both were evaluated at three reasoning levels: "low," "medium," and "high." Those settings change how deeply the model attempts to reason through a task.

The "low" setting emphasizes speed and low token usage. The "high" setting is meant to support more complete problem-solving. Across ARC-AGI-1 and ARC-AGI-2, the study covered 740 tasks and produced 4,400 data points.

o3 Leads, But the Released Model Falls Far Below o3-preview

On ARC-AGI-1, o3 reached 41 percent accuracy at low compute and 53 percent at medium compute. The smaller o4-mini model scored 21 percent at low compute and 42 percent at medium compute.

Those results place o3-medium at the top among publicly tested ARC Prize Foundation models on ARC-AGI-1. The analysis says it doubles the results of earlier chain-of-thought approaches.

But the comparison with o3-preview is striking. In December 2024, o3-preview scored 76 percent at low compute and 88 percent at high compute on ARC-AGI-1 in text mode. The released o3 model now reaches only 41 percent at low compute and 53 percent at medium compute.

OpenAI confirmed to ARC that the production o3 model is not the same system as the preview version. According to the explanation, the released model has a different architecture, is smaller overall, works multimodally with text and image inputs, and uses fewer computational resources than the preview model.

Training data also differs. OpenAI states that o3-preview training covered 75 percent of the ARC-AGI-1 dataset. For the released o3 model, OpenAI says it was not trained directly on ARC-AGI data, not even on the training dataset. The source still notes that indirect exposure is possible because the benchmark is publicly available.

Higher Reasoning Effort Does Not Always Pay Off

The analysis also raises a practical point for anyone comparing reasoning models: more compute is not automatically better. At the high reasoning level, both tested models failed to complete many tasks.

The Foundation observed that models tended to answer tasks they could solve more easily while leaving harder tasks unanswered. Counting only completed answers would make the systems look stronger than they really were, so those partial results were excluded from official leaderboards.

For simpler tasks, o3-high used significantly more tokens without producing a matching accuracy gain. That matters because token use affects cost, speed, and deployment choices.

The ARC Prize Foundation's guidance is direct: for cost-sensitive applications, o3-medium is the recommended default. The high-reasoning mode is presented as useful only when maximum accuracy matters more than cost.

There is no compelling reason to use low if you care about accuracy,

Mike Knoop, co-founder of the ARC Prize Foundation, says.

This shifts attention from raw scores to efficiency. As models improve, the Foundation argues that the key difference becomes how quickly and cheaply they solve problems while using fewer tokens.

o4-mini is important in that context. It reaches 21 percent accuracy on ARC-AGI-1 at about five cents per task. Older models such as o1-pro require roughly eleven dollars per task for comparable results.

ARC-AGI-2 Shows the Remaining Gap

The harder ARC-AGI-2 benchmark is where the limits become most visible. Both o3 and o4-mini scored below three percent accuracy on ARC-AGI-2.

That result contrasts sharply with human performance reported in the source. Humans solve an average of 60 percent of ARC-AGI-2 tasks even without special training. OpenAI's strongest reasoning model currently reaches only about three percent.

ARC v2 has a long way to go still, even with the great reasoning efficiency of o3. New ideas are still needed,

Knoop writes.

The takeaway is not that progress has stopped. The released o3 model still shows improvement over earlier chain-of-thought approaches on ARC-AGI-1. But ARC-AGI-2 suggests that present systems remain far from matching human-like problem-solving on this type of benchmark.

What the Results Mean for AI Claims

The analysis points to a broader lesson about AI evaluation. A benchmark result from an unreleased model may not describe the system that later appears in production. Product tuning, architecture changes, model size, multimodal capability, compute limits, and training data all affect performance.

The released o3 model has been refined for chat and product use cases. According to ARC Prize, that creates both advantages and disadvantages on ARC-AGI. A model can become more useful in one setting while losing ground on a narrow reasoning benchmark.

The source also notes a recent analysis suggesting that reasoning models such as o3 probably do not have new capabilities beyond those of their foundational language models. Instead, they may be optimized to reach correct solutions faster on certain tasks, especially tasks shaped by targeted reinforcement learning.

That framing helps explain why o3 can be impressive and limited at the same time. It can lead public ARC-AGI-1 results among tested models while still struggling badly on ARC-AGI-2. It can benefit from more reasoning effort in some cases while making high-compute runs too costly or incomplete in others.

For users, the practical lesson is to treat AI reasoning benchmarks as evidence, not as final proof of general intelligence. For developers, the results point toward a harder question: whether current chain-of-thought approaches can scale efficiently, or whether new ideas are needed to close the gap ARC-AGI-2 exposes.