MIT Tech Review AI July 30, 2024 TERMINATOR

Why solving Olympiad math may move AI reasoning forward

Google DeepMind built AlphaProof and AlphaGeometry 2, AI systems that together solved four out of six problems from this year's International Mathematical Olympiad. The result matters because math offers clear tests of reasoning, correctness, planning and generalization.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story signals meaningful progress in AI reasoning and problem-solving capability, but in a controlled research context without direct harm or autonomy.

Why solving Olympiad math may move AI reasoning forward

Google DeepMind has shown a notable step forward for AI reasoning with two systems built to solve difficult math problems. AlphaProof and AlphaGeometry 2 worked together on this year's International Mathematical Olympiad and solved four out of six problems, a performance equivalent to a silver medal.

The achievement is not only about mathematics. It points to a broader question in artificial intelligence: can systems move beyond producing convincing language and begin handling tasks where the answer can be checked with precision?

Why the result stands out

AI news has recently been crowded with major product announcements. Meta updated its powerful new Llama model and is offering it for free, while OpenAI said it would trial SearchGPT, an AI-powered online search tool that users can chat with.

Those developments matter, but the Google DeepMind result is different in kind. Chat-based search can make a system feel more intelligent. A math-solving system has to demonstrate reasoning under stricter conditions.

The International Mathematical Olympiad is a prestigious competition for high school students. Its problems are designed to be difficult, abstract and resistant to quick pattern matching. To perform well, a system must plan, handle symbolic ideas and produce work that can be evaluated as correct or incorrect.

That is why solving four out of six problems is such an important signal. According to the source article, this is the first time any AI system has reached such a high success rate on these kinds of problems.

Math as a test for AI reasoning

Math is useful for AI research because it gives researchers something language tasks often do not: a clear way to judge whether a result is right. A model can write fluent text that sounds persuasive while still being wrong. In mathematics, a proof has to hold up.

For AlphaProof and AlphaGeometry 2, the challenge involved more than retrieving known facts. The systems needed to work across different branches of mathematics and generalize across a range of problems. That ability to transfer methods from one problem type to another is one reason the work is being treated as a milestone.

David Silver, principal research scientist at Google DeepMind, described the approach as a combination of reinforcement learning, which was central to systems such as AlphaGo, with large language models. In this case, the systems used that capability to construct programs in Lean, a computer language used to represent mathematical proofs.

The key idea is that reinforcement learning can be especially powerful when a system receives clear feedback. In math, correctness can be measured more unambiguously than in many open-ended tasks. Silver said the same recipe could apply in any setting with clear, verified reward signals and a reliable way to measure correctness.

Where this could matter beyond math

The source article points to coding as one possible application. That makes sense within the facts described: software can often be tested, checked and verified in ways that resemble mathematical correctness more closely than ordinary conversation does.

Silver also discussed wider possibilities for systems that can prove or verify things. He said, “We are aiming to provide a system that can prove anything.” The source article frames the long-term idea as an AI system that could become as reliable as a calculator for certain proof-based or verification-heavy tasks.

Potential uses described in the source include:

providing proofs for challenging problems
verifying tests for computer software
checking scientific experiments
supporting AI tutors that give feedback on exam results
fact-checking news articles

These are not claims that the current systems can already do all of that. They are examples of where stronger proof and verification tools could eventually be useful if the underlying reasoning becomes more capable and dependable.

The reality check

The result is impressive, but the limits are important. AlphaProof and AlphaGeometry 2 solved hard high-school-level problems. That is still far from the extremely hard problems top human mathematicians can solve.

Google DeepMind also stressed that the tool has not, at this point, added to the body of mathematical knowledge created by humans. In other words, the breakthrough is about performance on a demanding benchmark, not about discovering new mathematics.

That distinction matters. The systems demonstrated that AI can make progress on a class of problems where planning, abstraction and verification are central. They did not show that AI has matched expert mathematicians at the frontier of the field.

Still, the direction is significant. If AI systems can become stronger in domains where correctness can be checked, they may become more useful in technical work that depends on reliability rather than just fluency.

A broader reason for optimism

Katie Collins, a researcher at the University of Cambridge who specializes in math and AI and was not involved in the project, pointed to another possible effect. She said these tools can create and evaluate new problems, encourage new people to enter the field and spark more wonder.

That may be one of the most practical implications. Better math-focused AI does not only promise automation. It could also support learning, exploration and the creation of new challenges for people who want to understand mathematics more deeply.

The larger lesson is simple: progress in AI is not only about chatbots, search interfaces or bigger language models. Sometimes the more important signal comes from systems that can handle a problem with a clear answer, a strict standard and no room for bluffing.