The Decoder July 7, 2025 TERMINATOR

How Sakana AI makes large language models solve problems together

Sakana AI has built AB-MCTS, a method that lets several large language models work on the same problem together. Early ARC-AGI-2 tests suggest the multi-model approach can solve cases that single models miss, though choosing the final answer remains a major challenge.

WTF Index TERMINATOR

◄ Terminator 1 Idiocracy 0 ►

The story describes a capability improvement in coordinated AI problem-solving, but with no clear autonomy or harm angle.

How Sakana AI makes large language models solve problems together

Sakana AI is pushing a different idea of progress in artificial intelligence: instead of asking one large language model to solve a hard task alone, make several models collaborate, compare paths, and improve each other’s work.

The Japanese AI startup’s new method, AB-MCTS, is designed to coordinate models such as ChatGPT, Gemini, and DeepSeek as they search for answers. Early tests on ARC-AGI-2 suggest that this kind of model teamwork can outperform single-model attempts, especially when a problem benefits from multiple ways of thinking.

What AB-MCTS Does

AB-MCTS stands for Adaptive Branching Monte Carlo Tree Search. In plain terms, it is an algorithm for exploring possible solutions while deciding when to keep improving one idea and when to branch out into a different one.

That balance matters because complex problems often fail in two opposite ways. A system can spend too much effort polishing an answer that is fundamentally wrong. It can also jump between too many ideas without developing any of them far enough to become useful.

AB-MCTS tries to manage that tradeoff. It combines depth search, which refines an existing solution, with breadth search, which explores new approaches. A probability model then decides which direction should be pursued next.

The key change in the multi-model version is that the search is not limited to one AI system. Multi-LLM AB-MCTS can select from models such as ChatGPT, Gemini, or DeepSeek depending on which one appears best suited to the current part of the task.

Why Multiple Models Can Help

Large language models do not always fail in the same way. One model may find a useful route into a problem while another may improve the reasoning, spot a better pattern, or test a different interpretation.

Multi-LLM AB-MCTS is built around that practical observation. The models exchange and refine suggestions as the search unfolds. Instead of treating each model as a standalone answer generator, the system treats them as contributors inside a shared process.

The source article compares the setup to a human team, but the important point is more technical: the algorithm can shift work among models while the task is still in progress. That makes the model choice adaptive rather than fixed at the start.

This is different from simply asking several models the same question and picking the best-looking reply. AB-MCTS structures the exploration itself. It decides whether to deepen a current path, widen the search, or call on a different model for the next step.

What ARC-AGI-2 Tests Showed

Sakana AI tested the method on ARC-AGI-2, described in the source as a challenging benchmark. In those tests, Multi-LLM AB-MCTS solved more problems than Single-LLM AB-MCTS.

The most notable result is not just that the combined system did better overall. In several cases, the correct answer appeared only when different models were used together. That suggests the benefit came from interaction across models, not merely from giving one model more chances.

Still, the results also show the limits of the current approach. When the system can make unlimited guesses, it finds a correct answer about 30 percent of the time. Under the official ARC-AGI-2 benchmark setting, where submissions are usually limited to one or two answers, the success rate drops significantly.

That gap highlights a central problem for this kind of system. Generating useful candidate answers is only part of the work. The system also needs a reliable way to identify which candidate should be submitted when only one or two tries are allowed.

The Next Bottleneck Is Selection

Sakana AI plans to work on methods that automatically choose the strongest suggestions. One idea mentioned in the source is to use another AI model to evaluate the options before the final answer is selected.

The company is also considering combinations with systems where AI models discuss solutions with each other. That could make the process more deliberative, though the source does not provide performance results for that direction.

The challenge is easy to understand. A collaborative model system may produce a richer set of possible answers, but a benchmark may reward only the final choice. If the selection mechanism is weak, the system can still miss even when the correct answer appeared somewhere in the search.

For developers, that means AB-MCTS is not just about model orchestration. It is also about ranking, filtering, and deciding under constraints. Those steps may become as important as the individual model outputs.

TreeQuest Fits Sakana AI’s Broader Direction

Sakana AI has released the algorithm as open-source software under the name TreeQuest. That makes it available for developers who want to apply the method to their own problems.

The release also fits a broader pattern in Sakana AI’s recent work. The source describes a busy summer for the Tokyo startup, including several projects built around iteration, adaptation, and modular agent behavior.

Darwin-Gödel Machine is an agent that rewrites its own Python code in rapid genetic cycles. Dozens of variants are created and tested on SWE-bench and Polyglot, with only the strongest performers kept. After 80 rounds, SWE-bench accuracy rose from 20% to 50%, while Polyglot scores more than doubled to 30.7%.
ALE agent reached the top 21 at a live AtCoder Heuristic Contest in June, outperforming over 1,000 human participants. It uses Google’s Gemini 2.5 Pro with optimization methods including simulated annealing, beam search, and taboo lists.
Transformer², described as January’s study, focused on continual learning in large language models.

Together with AB-MCTS, those projects point to a consistent research direction. Sakana AI is exploring systems that iterate, evolve, select, and combine approaches instead of relying on a single model pass.

For now, Multi-LLM AB-MCTS is promising but incomplete. Its strongest message is that collaboration among large language models can produce answers that isolated models may not find. Its main unresolved question is whether future selection methods can turn more of those discovered possibilities into correct final submissions.