GPT-5.2 Pro has delivered a notable result on one of the toughest public tests for AI math ability. In testing reported by Epoch AI, the model reached 31 percent on FrontierMath's hardest tier, Tier 4, setting a new record and moving past Gemini 3 Pro's previous best of 19 percent.
The result is important because it is not only a higher score. GPT-5.2 Pro solved 15 of 48 tasks, including four problems that had not been solved by any model before. That combination makes the benchmark result a meaningful signal for anyone tracking how quickly AI systems are improving at advanced mathematical reasoning.
What Epoch AI Tested
The benchmark at the center of the result is FrontierMath, described in the source as notoriously difficult. The relevant score came on Tier 4, the hardest tier. On that tier, GPT-5.2 Pro reached 31 percent.
That score matters most when compared with the previous leading result named in the source: Gemini 3 Pro at 19 percent. The gap between 31 percent and 19 percent is the clearest numerical sign that GPT-5.2 Pro has moved the benchmark forward.
Epoch AI ran the tests manually through the ChatGPT website because of API issues. That detail is worth keeping in view because it explains how the evaluation was carried out. The source does not describe an automated API-based run for this result.
The Numbers Behind the Result
The headline score is only one part of the picture. GPT-5.2 Pro solved 15 of 48 tasks. Within that set were four problems that no model had solved before.
Those four newly solved tasks are especially notable. A benchmark can show progress in two ways: by improving a percentage score and by breaking through on items that had remained out of reach. GPT-5.2 Pro did both in this reported test.
The core facts are straightforward:
- GPT-5.2 Pro reached 31 percent on FrontierMath Tier 4.
- Gemini 3 Pro's previous best was 19 percent.
- GPT-5.2 Pro solved 15 of 48 tasks.
- Four of those tasks had not been solved by any model before.
- Epoch AI ran the tests manually through the ChatGPT website because of API issues.
Taken together, these points make the result more than a small benchmark fluctuation. The score, the comparison with Gemini 3 Pro, and the previously unsolved problems all point in the same direction: GPT-5.2 Pro performed at a level not previously reported for an AI model on this hardest tier.
Why Mathematicians' Reactions Matter
The source says several mathematicians reviewed the solutions and gave them mostly positive assessments. That is an important part of the story because math benchmarks depend not only on whether an answer appears to land correctly, but also on whether the reasoning is convincing.
At the same time, the reaction was not unqualified. Some reviewers criticized the lack of precision in certain explanations. That caveat limits how far the result should be taken.
In practical terms, the benchmark result suggests stronger capability, but the comments about precision show that quality still has to be judged carefully. A system can solve difficult tasks while still producing explanations that experts may find incomplete or insufficiently exact in places.
This distinction is central to understanding AI progress in mathematics. The source supports a clear conclusion that GPT-5.2 Pro set a record on FrontierMath Tier 4. It does not support the broader conclusion that the model's mathematical explanations are uniformly precise.
How This Fits With Recent AI Math Reports
The FrontierMath result also lines up with recent reports mentioned in the source about AI models becoming genuinely useful for mathematical work. Those reports particularly involve GPT-5-Thinking and -Pro.
The source says GPT-5 has reportedly solved Erdős problems on its own and helped researchers work through others. That context makes the GPT-5.2 Pro result part of a larger pattern: advanced models are not only answering benchmark questions, but are also being discussed in connection with real mathematical problem solving.
Still, the source includes a clear warning from renowned mathematician Terence Tao, who cautions against drawing premature conclusions. That caution is essential. A record benchmark result is evidence of progress, but it is not the same thing as a complete verdict on how AI systems will function across mathematics as a discipline.
For now, the most grounded reading is also the most useful one. GPT-5.2 Pro has set a new reported high mark on the hardest tier of FrontierMath, outperforming Gemini 3 Pro's previous score and solving problems that had resisted earlier models. The result strengthens the case that frontier AI systems are becoming more capable at difficult math, while the expert feedback shows that rigor and precision remain the standard by which those systems must be judged.