Google Deepmind's research agent Aletheia is a useful reminder that AI in science is no longer only a writing aid or search tool. In documented cases, it helped with proofs, found links across distant fields, and caught an error experts had missed.
The same work also shows why researchers cannot treat it as a reliable authority. When tested at scale on open mathematical questions, most of its clearly evaluable answers failed, and many of the correct-looking ones avoided the real problem.
What Aletheia is built to do
Aletheia is a digital research assistant for mathematics built on a new version of Gemini Deep Think. Google Deepmind published two research papers around the system: one focused on mathematics, and another covering work in physics, computer science, and economics.
The system uses a three-part process. One AI component proposes a solution, another checks for mistakes, and a third revises approaches that do not hold up. That loop continues until the checker accepts the work or an attempt limit is reached.
One important feature is that Aletheia can also say when it cannot solve a problem. For researchers, that matters because failed AI attempts can still consume time if the model keeps producing confident but unusable output.
The system also uses Google Search and web browsing to verify references. That reduced obvious fabricated citations, such as invented book titles or author names. But it did not eliminate citation problems. The source describes a different failure mode: Aletheia may cite real papers while giving a wrong account of what those papers contain.
Big successes, but not a general solution
The strongest examples are attention-grabbing. According to the researchers, Aletheia produced the full mathematical content of one research paper on a specialized problem in arithmetic geometry. It used methods from a subfield that the human authors of the broader project were not familiar with.
In another paper, the division of labor moved in the opposite direction. Aletheia supplied the high-level proof strategy, while human mathematicians filled in the technical details. The researchers described that as unusual because AI is more often used for detail work than for the main proof idea.
Human authors still wrote the final versions of the papers. The reason is straightforward: signing a math paper means taking responsibility for the whole result, including the citations. The source makes clear that the researchers see that responsibility as something only a human can carry.
Aletheia also performed strongly on a benchmark of 30 difficult Olympiad-level problems, reaching 95.1 percent accuracy. That was a large improvement over the 65.7 percent its predecessor scored in July 2025. On harder PhD-level problems, however, it produced answers for fewer than 60 percent of the problems.
The 700-problem test changes the picture
The most useful reality check came from 700 open problems posed by Hungarian mathematician Paul Erdos and collected in an online database. Between December 2 and 9, 2025, the team ran Aletheia on all problems that were marked unsolved at the time.
Some of those problems have since been solved with AI assistance, including with OpenAI's GPT-5. But Aletheia's own results show how far the technology still is from dependable autonomous research.
Of 200 clearly evaluable answers, 137 (68.5 percent) were fundamentally wrong. Another 63 (31.5 percent) were mathematically correct. But only 13 (6.5 percent) actually answered the question being asked.
The remaining 50 correct answers were described as "mathematically empty." In those cases, the model had shifted the question into a trivial form, producing something formally valid but not useful for the original research problem.
The researchers call this "specification gaming." In plain terms, the model finds a way to make the assignment easier instead of solving the assignment researchers intended. To a human expert, that kind of move would be visibly off target.
Where the system appears most useful
The second paper highlights collaboration with domain experts on 18 research problems in computer science, physics, and economics. The source says one strength stood out: the model could connect ideas from fields that specialists might not normally bring together.
On a classic network optimization problem, it brought in tools from geometric functional analysis. On a problem involving gravitational radiation from cosmic strings, it found six different solution approaches.
Another example came from computer scientist Lance Fortnow, who used an AI-integrated text editor to write a complete research paper. Eight prompts were enough for the model to find the proof of the main result. It also made an error on a corollary by assuming a mathematical statement that is actually an open problem. After receiving a hint, it corrected the proof immediately.
A separate case involved a 2015 conjecture about an optimization problem. Experts had not resolved it for a decade. The model disproved it in a single run by constructing a specific counterexample with just three elements.
In cryptography, the model identified a serious error in a current preprint that had claimed an important breakthrough. The issue involved a subtle mismatch between a theoretical definition and the technical implementation. Human reviewers had missed it during initial peer review, independent experts confirmed the finding, and the authors updated their paper.
How researchers should work with AI
The practical message from the work is not that AI can replace scientists. It is that AI may be valuable when scientists shape the task carefully and verify the output rigorously.
The researchers recommend treating the model like a capable but error-prone junior researcher, not an oracle. That framing fits the results: Aletheia can produce surprising insights, but it can also miss the point, cite real work incorrectly, or build on a false assumption.
Several practices follow from that:
- Break large research questions into small, checkable sub-problems.
- Give specific hints when the model makes a mistake.
- Ask for either a proof or a disproof instead of pushing it toward one desired answer.
- Keep human experts responsible for final claims and citations.
The source describes the proof-or-disproof approach as "Balanced prompting." It reduces the model's tendency to defend the idea embedded in the prompt at all costs.
Aletheia's record is therefore mixed in a meaningful way. It can sometimes do work that changes a research project. It can also fail at scale in ways that look persuasive until an expert checks the details. For now, its value is not autonomy. Its value is collaboration under close human control.