Why OpenAI’s IMO result raises the bar for AI reasoning

OpenAI says an experimental language model solved five of six IMO 2025 problems and reached gold medal level with 35 out of 42 points. The claim is significant because the model used natural language proofs under standard competition conditions, but the result has not yet been independently confirmed.

WTF Index TERMINATOR
◄ Terminator 1 Idiocracy 0 ►

A claimed leap in general AI reasoning mildly points toward more powerful systems, though it is not about autonomy, control, or harm.

Why OpenAI’s IMO result raises the bar for AI reasoning

OpenAI says one of its experimental language models has reached gold medal level on the International Mathematical Olympiad, a result that would mark a major step for AI reasoning if it holds up under independent scrutiny.

The model solved the first five of the six official IMO 2025 problems and earned 35 out of a possible 42 points, according to OpenAI researchers Alexander Wei and Noam Brown. The result stands out because the IMO is described as the most difficult math competition for high school students and requires both creativity and strict logical reasoning.

What OpenAI Says The Model Achieved

The central claim is straightforward: OpenAI’s experimental model completed IMO 2025 problems at a level associated with a gold medal. The company says the model worked under standard competition conditions, using two 4.5-hour sessions, no outside help, no tool use, and answers written in natural language.

Former IMO medalists graded the responses anonymously. The full solutions are available on GitHub, according to the source article.

That setup matters because it frames the result as more than a benchmark score. OpenAI is presenting the model as capable of producing rigorous mathematical arguments in ordinary language, rather than relying on a special-purpose math system or a custom evaluation framework.

Alexander Wei described the achievement as the first AI model able to “craft intricate, watertight arguments at the level of human mathematicians.” That is a strong claim, and the source article is clear that the results have not yet been independently confirmed.

Why This Is Different From Narrow Math Systems

OpenAI’s researchers position the model as a general-purpose reasoning language model. That distinction is important because some earlier AI math systems were built specifically for mathematical problem solving.

Unlike DeepMind's AlphaGeometry, which is built specifically for math, OpenAI’s model is described as a broader reasoning system. Wei said the result came through “general-purpose reinforcement learning and test-time compute scaling,” not through a narrow, task-specific method.

Noam Brown also described the work as based on “new experimental general-purpose techniques.” He said the model scales its compute at test time, but did not provide the technical details.

Brown summarized the difference in thinking time by comparing the model with earlier OpenAI systems: “o1 thought for seconds. Deep Research for minutes. This one thinks for hours.” In plain terms, OpenAI is saying the model spends far more compute during problem solving, rather than simply answering quickly from a fixed pattern.

That makes the result relevant beyond contest math. If a general-purpose model can build long, coherent chains of reasoning under strict conditions, the same direction could eventually matter for scientific and technical work. The source article notes, however, that the real value depends on whether the result can be reproduced independently and applied to real scientific problems.

The Update From Jerry Tworek

An update dated Jul 20, 2025 added comments from OpenAI researcher Jerry Tworek. He confirmed on X that the model received “very little IMO-specific work” and said the result came from continued training of general-purpose base models.

Tworek also said all solutions used natural language proofs and no special evaluation framework. He called the achievement a genuine research breakthrough from Alexander Wei's team.

The update adds another important detail: Tworek later said a public release of the model is possible by the end of the year. That sits alongside Wei’s earlier statement that OpenAI had no plans to release this model or a similar one in the coming months, and that the IMO model was a research project.

Tworek also linked the IMO result to other OpenAI announcements from the same week. He said the general AI agent system, a close loss to a human in a heuristic programming contest, and solving 5 of 6 IMO problems came from the same reinforcement learning system. According to Tworek, ChatGPT agent runs on an earlier version built on an older base model.

How Current Models Compare

The announcement arrived after current AI models performed poorly on the same competition tasks. A recent MathArena.ai evaluation tested several leading models, including Gemini 2.5 Pro, Grok-4, DeepSeek-R1, OpenAI’s o3, and OpenAI’s o4-mini.

None of those models reached the 19 points needed for a bronze medal. Gemini 2.5 Pro led the group with 13 out of 42 points, while the others scored lower.

The evaluation was not a casual test. It included a best-of-32 selection process and assessment by IMO experts. Even with that setup, the models produced logical errors, incomplete arguments, and made-up theorems.

That context makes OpenAI’s claim more striking. If the experimental model truly earned 35 out of 42 points under standard competition conditions, it would represent a sharp gap between frontier research systems and the current models available for comparison.

What Still Needs To Be Proven

The biggest limitation is confirmation. The source article states that OpenAI’s results have not yet been independently verified. Until that happens, the claim remains an important research announcement rather than a settled fact.

There are also open questions about what approach other teams used this year. The article notes rumors that DeepMind has also earned a gold medal in the IMO contest, but says the company has not made any official announcement.

Last year, DeepMind's AlphaProof and AlphaGeometry systems reached silver by solving four out of six problems. Those systems used a hybrid method combining a pre-trained LLM with elements from classic search algorithms.

For OpenAI, the key question is whether a standard language model can keep scaling this kind of reasoning. Brown argued that even a small advantage over human performance can drive major scientific progress, and said the result surprised people inside OpenAI. He called it “a milestone that many considered years away.”

For now, the claim is best read carefully: OpenAI says its experimental model reached gold medal level on IMO 2025, using natural language proofs and no tools, while current public models remain far behind. The next test is whether the result can be independently reproduced and whether the same reasoning ability transfers beyond competition math.