Ars Technica AI February 28, 2025 NEUTRAL

Why GPT-4.5 makes AI progress look more expensive

OpenAI’s GPT-4.5 is presented as a large, compute-heavy research preview rather than a replacement for GPT-4o. The model shows modest gains in some areas, but its price, coding results and weaker reasoning benchmarks make its value harder to defend.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 1 ►

The story is mainly a business and capability update about model cost and incremental progress, without a clear harm or societal-decline angle.

Why GPT-4.5 makes AI progress look more expensive

GPT-4.5 arrives with a complicated message: it is OpenAI’s newest and most capable traditional AI model, but it is not being framed as the obvious next default. The model is bigger, slower and more expensive than GPT-4o, while many of its measured gains appear incremental rather than transformative.

That tension matters because GPT-4.5 is being judged not only as a product, but also as a signal about where large language models may be heading. If a much larger model delivers only modest improvements at sharply higher cost, the economics of simply scaling traditional models become harder to ignore.

A larger model with a narrower pitch

OpenAI launched GPT-4.5 as a relatively low-key “Research Preview” for ChatGPT Pro users. The company also made clear that the model is not intended to replace GPT-4o.

In its release post, OpenAI wrote: “GPT‑4.5 is a very large and compute-intensive model, making it more expensive⁠ than and not a replacement for GPT‑4o,” adding that it is evaluating whether to keep serving the model in the API long-term.

That language is unusual for a model release because it openly points to cost and availability as open questions. For users, the message is straightforward: GPT-4.5 may be useful in some situations, but OpenAI is not presenting it as the universal upgrade path.

The model is now available to ChatGPT Pro subscribers. Rollout to Plus and Team subscribers is planned for next week, followed by Enterprise and Education customers the week after. Developers can also access GPT-4.5 through OpenAI’s APIs on paid tiers, though the company has not committed to long-term API availability.

The cost problem is central

The biggest challenge for GPT-4.5 is not that it fails everywhere. It is that its improvements have to be measured against a much higher price.

Through the API, GPT-4.5 costs $75 per million input tokens and $150 per million output tokens. GPT-4o costs $2.50 per million input tokens and $10 per million output tokens. That makes GPT-4.5 30x the cost for input and 15x the cost for output compared with GPT-4o.

Against OpenAI’s reasoning models, the comparison is also difficult. The o1 pro model costs $15 per million input tokens and $60 per million output tokens. The o3-mini model costs $1.10 per million input tokens and $4.40 per million output tokens.

For developers, that pricing changes the decision from “which model is best?” to “which model is good enough for the task?” If GPT-4o already performs adequately in an application, GPT-4.5 needs a clear advantage to justify the additional spend.

GPT-4.5: $75 per million input tokens and $150 per million output tokens.
GPT-4o: $2.50 per million input tokens and $10 per million output tokens.
o1 pro: $15 per million input tokens and $60 per million output tokens.
o3-mini: $1.10 per million input tokens and $4.40 per million output tokens.

Better in some ways, weaker in others

OpenAI’s own benchmark results show that GPT-4.5 can improve on GPT-4o in certain areas. On the multilingual MMMLU general knowledge test, GPT-4.5 scored 85.1 percent, compared with GPT-4o’s 81.5 percent. OpenAI also says GPT-4.5 reduces confabulations, meaning it produces fewer false or misleading responses than earlier versions.

Human evaluators also preferred GPT-4.5’s responses over GPT-4o in about 57 percent of interactions. That suggests a measurable user-experience improvement, even if the gain is not dramatic.

Former OpenAI researcher Andrej Karpathy described the difference as real but hard to summarize. He wrote that GPT-4.5 is better than GPT-4o in subtle ways: “Everything is a little bit better and it’s awesome,” while also saying those improvements are not easy to point to.

But the model looks less impressive on reasoning-heavy tests. According to OpenAI’s benchmark results, GPT-4.5 scored 36.7 percent on AIME, while o3-mini scored 87.3 percent. GPT-4.5 also scored significantly lower than OpenAI’s simulated reasoning models, o1 and o3, on tests such as AIME math competitions and GPQA science assessments.

OpenAI CEO Sam Altman also set expectations around this point. He wrote that GPT-4.5 is not a reasoning model and would not dominate benchmarks, describing it instead as a different kind of intelligence.

Coding exposes the tradeoff

Coding appears to be one of GPT-4.5’s weakest areas relative to its price. The model has an October 2023 knowledge cutoff, which may leave out updates to development frameworks.

Tech investor Paul Gauthier tested GPT-4.5 with Aider’s Polyglot Coding benchmark. In that testing, GPT-4.5 ranked 10th in overall coding ability. Claude 3.7 Sonnet with extended thinking ranked at the top, with o1 and o3 also ahead.

GPT-4.5 also ranked poorly on performance versus cost in that benchmark. For developers using an API, that combination is important. A model can be technically capable and still be a poor fit if another model produces stronger coding results at a lower price.

This is why the early reaction has been mixed. An AI expert who requested anonymity told Ars Technica, “GPT-4.5 is a lemon!” Gary Marcus called the release a “nothing burger.” Those reactions focus less on whether GPT-4.5 has any improvements and more on whether those improvements justify the cost and compute required.

A possible end point for traditional scaling

GPT-4.5 also lands in the middle of a larger debate about diminishing returns in unsupervised-learning large language models. The source article argues that the model may support long-running concerns that scaling laws have reached a natural limit for this approach.

OpenAI has already spent much of last year working on simulated reasoning models such as o1 and o3. These models use an inference-time approach to improving performance, rather than relying only on ever-larger training runs for GPT-style models.

Altman previously wrote that GPT-4.5 will be the last of OpenAI’s traditional AI models. GPT-5 is planned as a dynamic combination of “non-reasoning” LLMs and simulated reasoning models like o3.

The immediate takeaway is not that GPT-4.5 is useless. It shows improvements in multilingual knowledge, reduced hallucinations and user preference over GPT-4o. But it also shows how hard it is to turn a larger traditional model into a broadly better product when reasoning models and competitors are changing the performance and cost equation.

For now, GPT-4.5 looks less like a clean leap forward and more like a transition point: a model that may be impressive in feel, expensive in practice and important mainly because of what it says about the next phase of AI model design.