MIT Tech Review AI September 17, 2024 TERMINATOR

OpenAI o1 pushes AI from fluent answers toward reasoning

OpenAI o1 is built for multistep reasoning rather than primarily language-heavy tasks. Its early results in coding, math olympiad questions, and PhD-level questions suggest a shift in what large language models may be expected to do, though experts warn that measuring reasoning remains difficult.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story mainly highlights AI becoming more capable at autonomous multistep reasoning, without clear evidence of direct danger or social degradation.

OpenAI o1 pushes AI from fluent answers toward reasoning

OpenAI’s o1 model marks a notable turn in the development of large language models. The point is not simply that it can produce smoother text, but that it is designed to work through harder, multistep problems in areas such as advanced mathematics, coding, and other STEM-based questions.

That distinction matters because much of the progress in LLMs has so far been driven by language. Chatbots and voice assistants have become better at interpreting, analyzing, and generating words, while still struggling with tasks that require careful constraints, correction, and step-by-step problem solving.

What makes OpenAI o1 different

OpenAI o1, previously referred to under the code name “Strawberry” and, before that, Q*, is focused on multistep “reasoning.” According to OpenAI, it uses a “chain of thought” technique that helps the model work through difficult tasks in a more deliberate way.

The company described the process this way: “It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working,” OpenAI wrote in a blog post on its website.

That is a different emphasis from models that are mainly suited to writing, editing, and other language tasks. The source article gives a simple example: GPT-4o, OpenAI’s leading model at the time, repeatedly failed to create a short wedding-themed poem under strict letter-count constraints. It could count letters after the fact, but it still produced poems that did not fit the prompt.

OpenAI o1 performed better on that kind of constrained task, though it was still not perfect. The broader point is that reasoning-heavy models are being aimed at problems where getting the structure right matters as much as producing fluent language.

The benchmark results are the headline

OpenAI’s tests show large gains in areas where careful reasoning is central. The model ranks in the 89th percentile on questions from the competitive coding organization Codeforces. It would also be among the top 500 high school students in the USA Math Olympiad, which includes geometry, number theory, and other math topics.

The model has also been trained to answer PhD-level questions in subjects ranging from astrophysics to organic chemistry. In math olympiad questions, OpenAI o1 reached 83.3% accuracy, compared with 13.4% for GPT-4o.

On the PhD-level questions, it averaged 78% accuracy. That compares with 69.7% from human experts and 56.1% from GPT-4o.

Those numbers help explain why OpenAI o1 is drawing attention. They suggest that large language models may be moving beyond being mainly useful for text generation and into territory where they can support more technical problem solving.

Why reasoning changes the stakes

The source article frames OpenAI o1 as one of the first signs that LLMs might soon become genuinely useful companions to human researchers in fields such as drug discovery, materials science, coding, and physics. Those are areas where a model has to do more than summarize information or produce polished prose.

A reasoning model could matter because many important tasks require breaking a problem into parts, checking intermediate steps, and changing direction when a path fails. That is the promise behind chain-of-thought reasoning in an AI model.

Matt Welsh, an AI researcher and founder of the LLM startup Fixie, said the model brings this capability to a mass audience. “The reasoning abilities are directly in the model, rather than one having to use separate tools to achieve similar results. My expectation is that it will raise the bar for what people expect AI models to be able to do,” Welsh says.

That expectation shift may be as important as any single benchmark. If users begin to expect AI models to solve technical problems, not just answer questions or draft text, the standards for leading models will change.

The limits still matter

The source article also makes clear that OpenAI o1 should not be treated as proof that AI has reached human-level reasoning. Yves-Alexandre de Montjoye, an associate professor in math and computer science at Imperial College London, warned that comparisons to “human-level skills” should be taken with a grain of salt.

One reason is that it is hard to compare how LLMs and people solve math problems from scratch. A correct answer does not automatically prove that a model reasoned in the same way a person would.

AI researchers also say that measuring reasoning is harder than it may appear. If a model answers correctly, it may have reasoned through the problem, or it may have benefited from a strong base of built-in knowledge. Google AI researcher François Chollet wrote on X that the model “still falls short when it comes to open-ended reasoning.”

These cautions do not erase the benchmark gains. They do, however, keep the claims in perspective. OpenAI o1 appears to be a major step in reasoning-heavy AI, but the boundaries of that capability are still being tested.

Cost and use cases will shape adoption

OpenAI o1 is also more expensive for developers using the API. The source article states that developers will pay three times as much as they pay for GPT-4o: $15 per 1 million input tokens in o1, versus $5 for GPT-4o.

That price difference matters because not every task needs a reasoning-heavy model. According to OpenAI’s user surveys, GPT-4o remains the better option for more language-heavy tasks.

For now, OpenAI o1 looks most relevant when the task requires structured problem solving, complex coding, advanced mathematics, or STEM-based reasoning. For writing and editing, the older model may still be the practical choice.

The real test will come as researchers and labs get enough access, time, and budget to explore what the model can and cannot do. The source article’s conclusion is cautious but significant: the race for models that can outreason humans has begun.