OpenAI's reported Strawberry project points to a central question in AI development: can models move from producing fluent answers to planning, researching, and solving difficult problems over longer stretches of work?
According to Reuters reporting cited by the source article, Strawberry is an internal OpenAI effort aimed at improving reasoning. A later Reuters update added that OpenAI internally tested an AI system that scored over 90 percent on the MATH benchmark, although Reuters could not confirm whether that system was Strawberry.
A reported push beyond ordinary answers
Strawberry is described as an OpenAI project designed to strengthen the reasoning abilities of the company's AI models. The project was previously known as Q* or Q-Star, according to the Reuters report referenced in the source.
The reported aim is not only to make an AI system respond more accurately, but to help it plan ahead. Internal OpenAI documents reviewed by Reuters reportedly describe using Strawberry models for autonomous web searches. The source article says this capability is referred to as "deep research" and could allow an AI to "navigate the web autonomously."
That distinction matters. A chatbot can answer a prompt in one step. A research-oriented AI agent needs to break a task into parts, look for information, decide what matters, and keep working toward a goal. The source describes Strawberry as part of that broader move toward systems that can reason first and then act.
What the MATH benchmark update suggests
The July 15, 2024 update in the source article says another Reuters source reported an internal OpenAI test of an AI that scored over 90 percent on the MATH benchmark. The MATH dataset, or Mathematics Aptitude Test of Heuristics, is used to measure how AI systems perform on complex mathematical problems.
The benchmark contains problems from math competitions for high school and college students. In the comparison given by the source article, the original GPT-4 scored around 53 percent, while GPT-4o achieves 76.6 percent.
A result above 90 percent would mean the tested AI solved most of those difficult problems correctly. The source article frames that as a sign of advanced mathematical ability and possibly stronger reasoning, with one important caution: that interpretation depends on whether the problems were not simply memorized.
Reuters could not confirm whether the tested AI was Strawberry. That detail keeps the report in careful territory. The benchmark result may be related to the same line of work, but the source does not establish that connection as fact.
How Strawberry may work
The exact technical details of Strawberry remain unknown, according to the source article. One Reuters insider said the project uses a special form of "post-training," where pre-trained models are adapted for specific tasks.
The source article also says the process involves a "deep research" dataset. Beyond that, the public picture is limited. The important reported idea is that Strawberry is not described as a completely separate kind of AI model, but as a way to adapt models for more demanding reasoning and research behavior.
OpenAI is reportedly targeting long-horizon tasks, or LHT. These are complex tasks that require planning and execution over extended periods rather than a single short response. The source article says Strawberry systems would be assisted by a "CUA," described as a computer-controlled agent that can independently perform actions based on the AI's results.
That setup would make the AI less like a passive text generator and more like an agent capable of taking steps. According to the Reuters source cited by the article, Strawberry is specifically being tested to take over tasks from software and machine learning engineers.
Why STaR and Quiet-STaR are relevant
The source article says OpenAI's approach is similar to a method from Stanford researchers called "Self-Taught Reasoner" (STaR), according to a Reuters source. STaR is described as a way to improve logical reasoning by teaching AI systems to read between the lines.
Quiet-STaR, an advancement of STaR presented in March, trains language models to generate possible reasons for continuation at every point in a text. Through trial and error, the AI learns which considerations produce better results. The source article notes that the longer the system can reason, the better the outcomes.
The connection is not presented as a confirmed technical blueprint for Strawberry. Instead, it is described as a similarity. Quiet-STaR could be abbreviated to "Q*", which is notable because Strawberry was reportedly formerly known as Q*.
The source article also says experts believe Q*/Strawberry combines large language models with planning algorithms, similar to chess programs or poker AI. Reinforcement learning and computation time during application are also described as likely to play a crucial role.
What remains uncertain
Strawberry has been a subject of speculation since last fall, when rumors of a potential OpenAI breakthrough began circulating. At the time, Q* was said to be capable of solving complex mathematical problems. OpenAI CEO Sam Altman indirectly confirmed Q*'s existence by calling it an "unfortunate leak."
Still, the state of the project is not clear from the source article. Reuters reporting describes goals, internal documents, sources, and possible links to research methods. It does not provide a public product, release plan, or full technical explanation.
The most grounded takeaway is narrower and more useful: OpenAI is reportedly working on AI systems that reason more deliberately, perform autonomous web research, and handle longer tasks. The MATH benchmark report, if tied to the same research direction, would fit that pattern, but the source explicitly says Reuters could not confirm whether the high-scoring AI was Strawberry.
For the future of AI, the stakes are straightforward. Better reasoning could change how models approach math, research, planning, and technical work. But the public evidence still leaves key questions open: how Strawberry works, how far it has progressed, and whether its reported abilities will hold up outside internal tests.