Ars Technica AI July 21, 2025 TERMINATOR

How Gemini Deep Think Reached IMO Gold by Showing Its Work

Google DeepMind says Gemini Deep Think solved five of six International Math Olympiad problems under the same rules used for human competitors. The result reached gold medal status and highlighted the importance of long-form reasoning, not just final answers.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story highlights a meaningful gain in advanced AI reasoning capability, but without direct autonomy, harm, or societal degradation concerns.

How Gemini Deep Think Reached IMO Gold by Showing Its Work

Google DeepMind says its Gemini Deep Think model reached gold medal status at the International Math Olympiad by solving five of the six problems under the same competition rules followed by human participants. The result matters because the IMO does not reward answers alone; it also depends on rigorous proof and a clear path from problem to conclusion.

The company’s claim stands out not only for the score, but for the process. DeepMind worked with the IMO to have the model’s work officially graded and certified by the coordinators, a point the company emphasized when comparing its approach with OpenAI’s announced results.

A harder test for AI reasoning

The International Math Olympiad is designed for pre-university mathematicians, but its problems are far from routine. Competitors need to connect ideas from algebra, combinatorics, geometry, and number theory, often across several steps. That makes the contest attractive to AI researchers who want a demanding test of reasoning rather than pattern matching.

DeepMind entered the competition with a model built around a different approach from last year. In 2024, its IMO system combined AlphaProof and AlphaGeometry 2. That setup solved four of the six questions and reached silver medal status. The result was already notable, especially because only half of the human participants earn any medal at all.

In 2025, the company used Gemini Deep Think. Google had announced the model earlier this year as a more analytical version of simulated reasoning. Instead of following a single reasoning path, Deep Think runs multiple reasoning processes in parallel, compares them, and integrates the results before producing a final answer.

Why natural language changed the setup

According to Thang Luong, DeepMind senior scientist and head of the IMO team, the new system represents a major change from the earlier effort. In 2024, an expert had to translate the competition’s natural language problems into a domain specific language. After the model produced output, an expert also had to interpret it.

Gemini Deep Think did not use that workflow. The model handled the problems in natural language from start to finish and was not specifically designed to do math. That mattered because it allowed the system to take in the same problem descriptions as the students and respond within the competition’s 4.5-hour time limit.

Deep Think also runs more slowly than the simpler Gemini versions available in the Gemini app. In this setting, speed was not the only goal. The model needed to produce mathematical work that could be judged under IMO expectations, where showing the reasoning is part of the evaluation.

Training for complete proofs

Luong explained that earlier attempts to improve LLM math ability often relied on reinforcement learning around final answers. That can help a model land on a correct result, but it does not necessarily produce a complete proof. For the IMO, that gap is important.

“incomplete reasoning,”

To prepare Deep Think for the contest, Google used new reinforcement learning techniques with higher-quality “long answer” solutions to mathematical problems. The aim was to strengthen the model’s ability to support each step, not merely to finish with the right number or statement.

“With this kind of training, you can actually get robust, long-form reasoning,”

That long-form reasoning was visible in the way DeepMind described the model’s performance on the third problem. Many human competitors used Dirichlet’s Theorem, a graduate-level concept outside the intended scope of the competition. Deep Think instead found a path using simpler mathematics.

“Our model actually made a brilliant observation and used only elementary number theory to create a self-contained proof of the given problem,”

That assessment came from DeepMind researcher and Brown University professor Junehyuk Jung. The point is not only that the model reached an answer, but that it produced a proof within the mathematical level expected by the competition.

The missed problem still reveals the limit

Gemini Deep Think did not solve everything. The problem it missed asked for the minimum number of rectangles needed to cover a given space. The team described it as objectively the hardest problem in the competition.

Jung said the model began from a false hypothesis: it assumed the answer would be greater than or equal to 10. Once that starting point was wrong, the rest of the attempt could not recover.

“There’s no way it’s going to solve it because that is not true to begin with,”

Only five students solved that problem. Even with the miss, Google received 35 points, enough for gold medal status. The source article notes that only about 8 percent of human participants reach that level.

Official grading became part of the story

DeepMind repeatedly stressed that Gemini Deep Think went through the same evaluation as student competitors. That distinction became important because OpenAI also announced IMO results, but did not work with the organization to follow the established process. Instead, OpenAI used a panel of former IMO participants to grade its answers and awarded itself a gold medal.

“We confirmed with the IMO organization that we actually solved five perfectly,”

“I think anyone who didn’t go through that process, we don’t know, they might have lost one point and gotten silver.”

The version of Deep Think tuned for the IMO is not being retired after the competition. Google says it is being rolled out to trusted testers, including mathematicians. Eventually, it will be offered to Google AI Ultra subscribers, who pay $250 per month for access to Google’s largest and most expensive models.

DeepMind also plans to keep improving the system and return next year in pursuit of a perfect score. For now, the result shows how much AI math performance depends on the quality of the reasoning trail, the grading process, and the ability to work under the same constraints as human competitors.