The Decoder December 21, 2024 TERMINATOR

Why OpenAI o3 raises the bar for AI reasoning

OpenAI has announced o3, a reasoning model that posts major gains on benchmarks for AGI-style tasks, math, coding and science questions. The results are notable, but the model’s heavy computing needs and remaining weaknesses show why it is not yet artificial general intelligence.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story highlights a meaningful jump in AI reasoning capability toward AGI-style tasks, tempered by high compute needs and safety testing.

Why OpenAI o3 raises the bar for AI reasoning

OpenAI has introduced o3, a new AI model built for complex reasoning tasks. The model improves sharply on several demanding benchmarks, while also making clear that stronger reasoning comes with higher computing costs.

A more affordable o3 mini version is planned for late January 2025, with the full version expected after that. OpenAI is also opening safety testing before release and presenting a new approach called "Deliberative Alignment."

What makes o3 different

OpenAI’s o3 is designed to spend more time and computing power working through difficult problems. Like the recently released o1, it does not simply produce a quick answer. It uses a longer reasoning process to search for a solution.

That makes o3 important for tasks where the answer is not obvious from familiar patterns. The source describes the model as a significant advance in problem-solving, especially when the challenge is unfamiliar or requires several steps.

François Chollet, who developed the ARC benchmark, describes o3’s performance as "a surprising and important step-function increase in AI capabilities." His explanation is that o3 appears to create new programs in real time to solve new problems, rather than mainly retrieving stored patterns.

Chollet compares the approach to Google DeepMind’s AlphaZero chess program. In that view, o3 works through possible solutions methodically until it finds a path that works. This helps explain both its stronger results and its heavy demand for computing resources.

The benchmark gains are broad

On the ARC Prize AGI benchmark, o3 reached 75.7 percent with standard computing power. With increased resources, that score rose to 87.5 percent. The ARC benchmark is treated as one signal of progress toward artificial general intelligence, or AGI.

The model also performed strongly on EpochAI’s Frontier Math Benchmark, which was introduced last November as one of the hardest available AI math tests. OpenAI’s o3 achieved a 25.2 percent success rate, while previous models could not break 2 percent.

The benchmark’s developers called the result a "significant leap" and said they are preparing "tougher, next-generation benchmarks" for future AI systems.

Other tests show similar movement:

Software task accuracy improved by 20 percent compared to o1, reaching 71.7 percent.
In competitive programming, o3 achieved a Codeforces score of 2727.
That Codeforces score surpassed OpenAI’s Chief Scientist’s score of 2665.
On PhD-level science questions in the GPT Diamond Benchmark, o3 scored 87.7 percent.
OpenAI says the roughly 70 percent average for PhD experts in their fields is lower than that o3 result.

Taken together, those numbers show a model that is stronger across reasoning, math, software and science evaluation. The gains are not limited to one narrow test.

Reasoning has a cost

The same reasoning process that improves o3 also makes it expensive to run. The system can process up to 33 million tokens for a single task, according to the source.

The high-efficiency version runs at about $20 per task. In testing terms, that equals $2,012 for 100 test tasks, or $6,677 for the full set of 400 public tasks, averaging about $17 per task.

The low-efficiency version uses far more resources. It requires 172 times more computing power than the high-efficiency version. OpenAI has not disclosed exact costs for that version, but testing showed it processed between 33 and 111 million tokens and required about 1.3 minutes of computing time per task.

That tradeoff matters. o3 shows how much better AI reasoning can look when a model is allowed to spend more compute on a problem. It also shows why the practical use of advanced reasoning models depends not only on raw capability, but also on cost, speed and where the model is being applied.

Why this is still not AGI

Despite the strong benchmark performance, Chollet says o3 is not artificial general intelligence. The source notes that the model still struggles with some basic tasks and remains fundamentally different from human intelligence.

Chollet’s test for true AGI is demanding: it would arrive only when people can no longer create tasks that are easy for humans but difficult for AI. By that standard, o3 is progress, not an endpoint.

That distinction becomes clearer with the next ARC test. Chollet has announced a more difficult successor for 2025. Early testing suggests o3 will score around 30 percent on ARC-AGI-2, while humans without special training can solve about 95 percent of its tasks.

This is the central tension around o3. It sets records on current benchmarks, but newer tests are already being prepared because the boundary keeps moving. Stronger benchmark scores can show progress, but they do not settle the broader question of general intelligence.

What to expect from o3 mini

OpenAI plans to release o3 mini in late January 2025, followed by the full version. The mini version is intended to be more affordable and will offer three speed settings: low, medium and high.

According to the source, o3 mini outperforms o1 even at medium settings, while being faster and more cost-effective. It also supports API functions such as function calls and structured outputs, matching or exceeding o1’s capabilities in those areas.

During a live demo, OpenAI showed o3 mini generating and executing code independently. One example involved creating a Python script that built a user interface for self-evaluation on a dataset.

Before release, OpenAI is launching a safety testing program, with applications open until January 10. The company is also introducing "Deliberative Alignment," a safety method that uses the model’s reasoning abilities to establish better safety boundaries.

OpenAI also explained the name. The company chose "o3" because it skipped "o2" out of consideration for the telecommunications company O2.