Why GPT-4.5 makes OpenAI's scaling bet look more complicated

OpenAI has launched GPT-4.5, code-named Orion, as its largest AI model to date. The model shows stronger factual accuracy and more natural responses in some areas, but its high cost and mixed benchmark results raise questions about how far traditional AI scaling can go.

WTF Index TERMINATOR
◄ Terminator 1 Idiocracy 0 ►

The story is mostly a routine model launch, with only a mild lean toward more powerful scaling and capability growth.

Why GPT-4.5 makes OpenAI's scaling bet look more complicated

OpenAI's GPT-4.5 arrives with a clear headline achievement: it is the company's largest model to date. But the more important story is what that size does, and does not, deliver.

The model, code-named Orion, was announced on Thursday and is being released first as a research preview. It extends the same pre-training approach behind GPT-4, GPT-3, GPT-2, and GPT-1, using more computing power and data than OpenAI's previous releases. Yet the launch also shows why the AI industry is watching scaling with more caution than certainty.

A bigger model with limited early access

GPT-4.5 is initially available to subscribers of ChatGPT Pro, OpenAI's $200-a-month plan, starting Thursday. Developers on paid tiers of OpenAI's API can also use GPT-4.5 starting today.

OpenAI told TechCrunch that ChatGPT Plus and ChatGPT Team users should get access sometime next week. That staged rollout fits the way OpenAI is presenting the model: not as a universal replacement, but as a research preview meant to expose strengths, weaknesses, and unexpected uses.

OpenAI described the release this way: "We’re sharing GPT‐4.5 as a research preview to better understand its strengths and limitations," adding, "We’re still exploring what it’s capable of and are eager to see how people use it in ways we might not have expected."

There was also an unusual shift after release. Hours after GPT-4.5 became available, OpenAI removed a line from the model's white paper that said "GPT-4.5 is not a frontier AI model." The newer white paper no longer includes that sentence.

What GPT-4.5 appears to do well

OpenAI says GPT-4.5 has "a deeper world knowledge" and "higher emotional intelligence." The company also argues that the model is better in areas that ordinary benchmarks may not fully capture, including understanding human intent and responding in a warmer, more natural tone.

On OpenAI's SimpleQA benchmark, which tests straightforward factual questions, GPT-4.5 beats GPT-4o and OpenAI's reasoning models, o1 and o3-mini, on accuracy. OpenAI also says GPT-4.5 hallucinates less often than most models, which would make it less likely to generate false information in factual tasks.

The model also supports file and image uploads and ChatGPT's canvas tool. Those capabilities make it useful for some familiar ChatGPT workflows, even though OpenAI is not positioning it as a drop-in replacement for GPT-4o.

OpenAI highlighted informal examples as well. In one test, GPT-4.5, GPT-4o, and o3-mini were asked to create a unicorn in SVG. GPT-4.5 was the only model that produced something resembling a unicorn. In another test, the models responded to the prompt, "I'm going through a tough time after failing a test." GPT-4o and o3-mini were helpful, but GPT-4.5 gave the most socially appropriate response.

Where the benchmark picture is mixed

The strongest case for GPT-4.5 is not that it wins everywhere. It does not. Instead, the model looks competitive in some non-reasoning tasks while falling behind newer reasoning models on harder benchmarks.

On the SWE-Bench Verified benchmark, a subset of coding problems, GPT-4.5 roughly matches GPT-4o and o3-mini. But it falls short of OpenAI's deep research and Anthropic's Claude 3.7 Sonnet. On OpenAI's SWE-Lancer benchmark, which evaluates the ability to develop full software features, GPT-4.5 beats GPT-4o and o3-mini, but still does not beat deep research.

The same pattern appears in difficult academic tests such as AIME and GPQA. GPT-4.5 does not quite match leading AI reasoning models including o3-mini, DeepSeek's R1, and Claude 3.7 Sonnet. At the same time, it matches or outperforms leading non-reasoning models on those tests, suggesting that it remains strong in math- and science-related tasks compared with its closest category peers.

OpenAI's own framing acknowledges this gap between benchmark rankings and real-world value. The company wrote, "[W]e look forward to gaining a more complete picture of GPT-4.5's capabilities through this release," because "academic benchmarks don't always reflect real-world usefulness."

The cost problem is hard to ignore

GPT-4.5 is not just large. It is expensive to run. OpenAI says it is evaluating whether to continue serving GPT-4.5 in its API over the long term.

The API pricing makes the cost difference clear. GPT-4.5 costs developers $75 for every million input tokens, described as roughly 750,000 words, and $150 for every million output tokens. GPT-4o costs $2.50 per million input tokens and $10 per million output tokens.

That gap matters because GPT-4o remains OpenAI's workhorse model for much of ChatGPT and the API. GPT-4.5 also lacks some capabilities that GPT-4o users may expect, including support for ChatGPT's realistic two-way voice mode.

For developers, the question is not only whether GPT-4.5 is better in certain tasks. It is whether those gains justify the much higher price, especially when the model does not clearly dominate reasoning systems on many difficult evaluations.

Why GPT-4.5 matters beyond one launch

GPT-4.5 is important because it tests a central assumption in modern AI development: that adding more data and computing power during unsupervised pre-training will keep producing major leaps in capability.

OpenAI says GPT‐4.5 is "at the frontier of what is possible in unsupervised learning." But the model's limits also point toward the pressure on that approach. Earlier GPT generations saw major gains from scaling across areas such as mathematics, writing, and coding. With GPT-4.5, those gains appear more uneven.

The industry has already been moving toward reasoning models, which take longer to complete tasks but can be more consistent. These systems use more time and computing power while working through problems, and AI labs believe that can improve capabilities in ways traditional pre-training may not.

OpenAI plans to combine its GPT series with its "o" reasoning series, beginning with GPT-5 later this year. In that context, GPT-4.5 may be less of a final answer and more of a bridge. It shows what a very large non-reasoning model can still do well, while also showing why the next stage of AI progress may depend on more than size alone.