OpenAI's o3 model has become an early signal that AI performance may still have room to climb, even as older scaling methods show signs of slowing. Its benchmark results suggest that giving a model more compute after a prompt is submitted can improve answers, but the tradeoff is clear: better performance may arrive with a much larger bill.
A new path for AI scaling
Last month, AI founders and investors told TechCrunch that the field had entered a "second era of scaling laws." The point was not that AI progress had stopped. It was that familiar ways of improving models were producing smaller returns than before.
One approach drawing attention is test-time scaling. In plain terms, this means using more compute during inference, the stage after a user presses enter and before the system returns an answer. OpenAI's o3 appears to be an example of that approach.
The model's announcement led many in the AI world to argue that progress has not "hit a wall." The reason is performance: o3 significantly outscored other models on ARC-AGI, a benchmark used to measure general ability, and it scored 25% on a difficult math test where no other AI model scored more than 2%.
There is still a major caveat. Very few people have tried o3 so far, and TechCrunch noted that it was withholding full judgment until it could test the model directly. Even so, the announcement has already shifted how many observers talk about the next phase of AI development.
What test-time scaling changes
The central idea behind test-time scaling is simple but important. Instead of relying only on what happened during pre-training, a model can spend more computation on the answer itself.
The exact mechanics behind o3 are not public. OpenAI may be using more computer chips to answer a question, running more powerful inference chips, or letting those chips work longer before producing the response. In some cases, the source article says this could mean 10 to 15 minutes before an answer appears.
That makes o3 different from a model designed to answer everyday questions quickly. A high-compute reasoning model may be more useful for difficult prompts where the value of a better answer outweighs the wait and the cost.
Noam Brown, co-creator of OpenAI's o-series of models, highlighted the speed of the jump from o1 to o3. OpenAI announced o1 just three months before o3. Brown wrote, "We have every reason to believe this trajectory will continue."
Anthropic co-founder Jack Clark also framed o3 as evidence that AI "progress will be faster in 2025 than in 2024." He suggested the field may combine test-time scaling with traditional pre-training scaling to get more performance from future AI models.
The benchmark gains are expensive
The strongest case for o3 comes from its benchmark results, but those results also show the cost problem. On ARC-AGI, o3 scored 88% in one attempt, while OpenAI's next best AI model, o1, scored 32%.
That gap is large. But the high-scoring version of o3 used more than $1,000 worth of compute for every task. By comparison, the o1 models used around $5 of compute per task, and o1-mini used just a few cents.
François Chollet, the creator of the ARC-AGI benchmark, wrote that OpenAI used roughly 170x more compute to reach the 88% score compared with a high-efficiency version of o3 that scored just 12% lower. The high-scoring version used more than $10,000 of resources to complete the test, making it too expensive to compete for the ARC Prize.
Chollet still described o3 as a breakthrough for AI models. He wrote, "o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain." He also stressed the economic side of the result, noting that the same generality "comes at a steep cost."
This matters because inference costs become less predictable when a model can spend more compute to improve an answer. Previously, providers could estimate serving costs by looking at the model and the output. With test-time compute, the cost may depend more heavily on how much effort the system applies to a specific problem.
Who would actually use o3?
The cost profile raises a practical question: what is o3 for? The source article argues that o3 and similar successors may not be a "daily driver" in the way GPT-4o or Google Search might be.
For routine questions, heavy inference spending may make little sense. The better fit could be large, consequential prompts where a more capable answer is worth a much higher compute cost.
That points toward users with high-value problems and large budgets. Wharton professor Ethan Mollick wrote that o3 looks too expensive for most use, but could still make sense in academia, finance and many industrial problems if the system is generally reliable.
OpenAI has already released a $200 tier to use a high-compute version of o1. The startup has also reportedly weighed subscription plans costing up to $2,000. When o3's compute demands are considered, those higher prices become easier to understand.
Still, higher cost does not erase risk. Chollet noted that o3 is not AGI and still fails on some easy tasks that a human would handle without difficulty. The source article also points out that large language models still have a huge hallucination problem, and test-time compute does not appear to have solved it.
The next bottleneck may be compute itself
If test-time scaling becomes a major path for AI improvement, inference hardware will matter even more. Better AI inference chips could help unlock more gains or reduce the cost of applying extra compute at answer time.
The source article names Groq and Cerebras as startups working on this area, while MatX is described as designing more cost-efficient AI chips. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch he expects these startups to play a bigger role in test-time scaling moving forward.
For now, o3 offers two lessons at once. It suggests that AI models can still improve through new scaling strategies. It also shows that the next stage of AI progress may be shaped as much by economics as by benchmark scores.