The Decoder June 21, 2025 TERMINATOR

How Sakana AI's ALE agent reached 21st at AtCoder

Sakana AI's ALE agent placed 21st in the 47th AtCoder Heuristic Contest against more than 1,000 human programmers. The result shows how an AI coding agent can compete on hard optimization tasks by combining expert methods with large-scale search.

WTF Index TERMINATOR

◄ Terminator 2 Idiocracy 0 ►

The story shows AI agents becoming more capable at hard optimization and planning tasks, but without direct evidence of harm or loss of control.

How Sakana AI's ALE agent reached 21st at AtCoder

Sakana AI has shown a new kind of AI coding performance in a live programming contest. Its ALE agent finished in 21st place at the 47th AtCoder Heuristic Contest, competing against more than 1,000 human programmers on complex optimization problems.

The result matters because these contests do not reward a single correct answer. They reward better and better solutions to problems where no known efficient method exists, making them closer to many difficult planning and scheduling problems found in industry.

Why the AtCoder result stands out

AtCoder runs Japanese programming competitions built around mathematical and algorithmic challenges. In this case, the contest focused on heuristic optimization, where participants write code that tries to produce the strongest possible score under difficult constraints.

The source describes these as "NP-hard" problems, meaning they do not have known efficient solutions. That makes them especially demanding for both humans and AI systems. Instead of solving a puzzle once, contestants must improve their approach repeatedly, often testing alternatives and refining code over time.

Sakana AI's ALE agent reached 21st place in the 47th AtCoder Heuristic Contest. That ranking put it ahead of many human experts and showed that an AI agent can do more than answer coding questions in isolation. It can participate in a competitive setting where progress depends on exploration, scoring, and revision.

The kinds of problems involved are not merely academic. The same broad class of optimization challenges appears in delivery route planning, work shift organization, factory production management, and power grid balancing. In those settings, a slightly better solution can matter because the system being optimized is already complex.

How ALE-Bench frames the challenge

The contest performance builds on ALE-Bench, which Sakana AI describes as the first benchmark for score-based algorithmic programming. The benchmark uses 40 difficult optimization problems taken from past AtCoder contests.

This is different from traditional coding tests that mark an answer as right or wrong. ALE-Bench is built around continuous improvement. The goal is not just to produce working code, but to keep raising the score through better algorithms, better parameter choices, and better search.

That distinction is important for understanding why the ALE agent is notable. A system that can pass a conventional test may still struggle when the task requires many cycles of experimentation. ALE-Bench instead asks whether an AI can keep improving a solution over an extended process.

Sakana AI released ALE-Bench as a Python library. It includes a built-in "code sandbox" for safe testing and supports C++, Python, and Rust. The framework runs on standard Amazon cloud infrastructure, and Sakana AI developed it with AtCoder Inc. The data from the 40 competition problems is available on Hugging Face, while the code is publicly accessible on GitHub.

What powers the ALE agent

The ALE agent runs on Google's Gemini 2.5 Pro and combines two main approaches. One part is expert knowledge: the system is instructed with established solution techniques used in hard optimization problems.

One example is simulated annealing. In plain terms, that technique tries random changes to a solution and can sometimes accept a worse result in order to avoid getting trapped in a poor local outcome. For score-based contests, that kind of controlled risk can be useful because the best path forward is not always obvious from the current result.

The second part is systematic search. The agent uses "best-first search," a method that selects the most promising partial solution and develops it further. It extends that approach with a "beam search"-like process, following 30 solution paths at the same time.

The agent also uses a "taboo search" mechanism. That lets it remember solutions it has already tried, reducing wasted effort from repeating the same work. Together, these pieces give the system a way to explore broadly while still using the score signal to decide where to focus.

The scores show the value of the full system

Sakana AI compared several models under identical conditions. The best model listed, o4-mini-high, reached 1,411 points with sequential improvements. GPT-4.1 mini scored 1,016 points, Deepseek-R1 reached 1,150 points, and Gemini 2.5 Pro achieved 1,198 points.

The complete ALE agent performed better than those model-only results. It reached 1,879 points and landed in the top 6.8 percent. On one individual problem, the agent scored 2,880 points, a result that would have placed 5th in the original competition.

Those numbers suggest that the agent architecture, not only the underlying model, is central to the result. The system benefits from combining model reasoning with search, memory, and repeated testing. In this setting, the workflow around the model appears as important as the model itself.

Why iteration gives AI an edge

The source highlights a practical difference between human contestants and the AI agent: pace. During a four-hour competition, human participants might test a dozen different solutions. Sakana's AI can cycle through around 100 versions in the same timeframe.

In practice, the ALE agent produced hundreds or thousands of possible solutions. That level of output is not something a human contestant can match manually. For optimization tasks, where many attempts may be needed before a strong solution appears, this ability to explore quickly becomes a major advantage.

That does not mean the contest becomes simple for AI. The problems remain hard, and the agent still needs guidance about which strategies are worth testing. But the result shows how AI coding agents can be useful when the task rewards structured trial, measured feedback, and persistent refinement.

Sakana AI's ALE agent is therefore a signal about where AI coding systems may be heading. The important shift is from writing one answer to managing a search process: generate code, test it, score it, remember what failed, and keep improving. For hard optimization work, that process is the product.