Math is becoming AI’s hardest AGI checkpoint

OpenAI researchers Sebastian Bubeck and Ernest Ryu argue that math has become a central test for progress toward artificial general intelligence. The reason is not just difficulty: math exposes whether AI models can reason for long periods, catch errors and produce work that experts can verify.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 0 ►

The story frames rapid math progress as evidence of increasingly powerful, longer-horizon AI reasoning toward AGI, though without direct harm or control risks.

Math is becoming AI’s hardest AGI checkpoint

AI progress in mathematics has moved quickly. According to OpenAI researchers Sebastian Bubeck and Ernest Ryu, models have gone from grade-school arithmetic to olympiad-level and research mathematics in only two years.

That shift matters because math is not just another benchmark. In their view, it is a demanding test of whether AI systems can sustain reasoning, handle mistakes and contribute to difficult work without losing the thread.

Why math has become the test case

Bubeck argues that mathematics is useful for measuring progress toward artificial general intelligence because it requires long chains of precise reasoning. A proof can fail because of one wrong step, even if every other part looks strong.

That makes math unusually strict. A model cannot simply sound persuasive. It has to keep the argument coherent, notice when something breaks and repair the reasoning.

The field also gives researchers a practical advantage: problems can be stated clearly, answers can be checked and there is little room for debate about whether a result is correct. For AI evaluation, that clarity is valuable.

Bubeck describes the shift in terms of "AGI time." Two years ago, models could imitate a student’s reasoning for minutes. Today, he says, they can operate across days or even a week. The next target is weeks and months.

From simple geometry to research help

The change has been sharp. Four years ago, Bubeck was impressed that Google's Minerva model could draw a line through points on a coordinate system. Now, he told Andrew Mayne, AI systems are helping Fields Medal winners with their daily work.

That does not mean the models are replacing mathematicians. The examples in the source point to a more collaborative pattern: the AI proposes ideas, searches through possibilities and generates steps, while an expert checks the work and directs the effort.

Ryu’s own example shows that dynamic clearly. A former UCLA math professor, he says he solved a 42-year-old open problem about Nesterov's method in optimization theory using ChatGPT. The work took twelve hours spread across three evenings, after he had already spent more than 40 hours without AI and made no progress.

In that process, Ryu did not treat the system as an authority. He acted as the verifier. He caught errors, rejected weak paths and pushed the conversation toward more promising directions.

What OpenAI wants to transfer beyond math

Bubeck’s broader point is that training progress in math is not supposed to stay confined to math. He says OpenAI's training methods are general, not specific to mathematics. If that is right, similar gains should appear in other scientific areas.

The source names biology and materials science as fields where this kind of reasoning could matter. The link is not that those fields are identical to mathematics. It is that they also require structured thinking, error checking and progress across difficult problems.

Bubeck compares the role of math in AI training to the role of math in human education. Students do not learn it only because they will later write proofs. They learn it because it forces logical thinking.

OpenAI researchers are also building what the source calls an "automated researcher." The goal is a system that can work on problems independently over long stretches of time. In that context, math becomes a proving ground for persistence and reliability.

The Erdős problems show both promise and confusion

Bubeck and Ryu also discuss the Erdős problems, a collection of open questions left by the late Hungarian mathematician. Bubeck says internal models initially found solutions to ten problems marked as open, mostly through deep literature searches.

The way that result was communicated caused confusion. Bubeck says his misleading tweet led many people to think OpenAI had produced new proofs. The misunderstanding sparked a public spat with Google CEO Demis Hassabis.

By now, Bubeck says, ChatGPT and internal models have produced more than ten genuinely new solutions worthy of publication in academic journals. He sees that as evidence that models are beginning to move beyond recombining existing knowledge and into producing new mathematics.

The philosophical question remains unresolved. The source notes that it is still open whether scientific progress is more than clever recombination plus some reasoning. But for practical purposes, the pace of progress has changed the debate.

Why expertise still matters

Both researchers warn against shallow use of AI for mathematics. Their position is not that anyone can now generate reliable proofs by prompting a model. It is almost the opposite: expertise becomes more important because someone has to judge the output.

They argue that trained mathematicians are the people most able to make the tools productive. Non-mathematicians posting long AI-generated proofs on social media are usually wrong, according to the source.

Ryu sees a related problem in programming, where he says a whole generation is losing the ability to use debuggers. The concern is mental atrophy: if users accept outputs too easily, they may lose the habits needed to find mistakes.

Bubeck also warns that claims about scientists no longer being needed are dangerous. Academic institutions, he says, need to actively reclaim their role.

At the same time, the researchers see real upside. AI could speed up proof verification, which currently takes years, and help flag problems in published papers. The future described here is not one where expertise disappears, but one where expert review may become faster and more powerful.