The Decoder July 20, 2025 NEUTRAL

Why ARC-AGI-3 makes simple reasoning hard for AI

ARC-AGI-3 is a new benchmark from François Chollet and his team that tests whether AI systems can learn in unfamiliar situations. Humans can solve the preview games quickly, while current AI systems have not reliably beaten them.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

The story is mainly about a reasoning benchmark showing current AI limitations, not clear harm or societal degradation.

Why ARC-AGI-3 makes simple reasoning hard for AI

ARC-AGI-3 puts a sharper spotlight on a stubborn gap in artificial intelligence: the ability to enter a new situation, infer what matters, and adapt without being told the rules. The benchmark, released by AI researcher François Chollet and his team, is designed to test general intelligence by removing familiar shortcuts and forcing systems to learn from scratch.

The result, at least in the current Developer Preview, is blunt. Humans can handle the available challenges quickly and easily, while AI systems still fall short on the same tasks.

A benchmark built around unfamiliar problems

ARC-AGI-3 is meant to evaluate whether AI systems can learn on their own in situations they have not seen before. The benchmark avoids language, trivia, and cultural symbols, which means a system cannot lean on memorized facts or broad background knowledge to get through the test.

Instead, the tasks are built around what the creators call "core knowledge priors." These include basic cognitive abilities such as object permanence and causality. In plain terms, the benchmark is interested in whether an AI agent can work out how objects behave, what actions change a situation, and what sequence of moves might lead to success.

That focus matters because many AI evaluations can reward systems for recognizing patterns already present in training data or for using language fluently. ARC-AGI-3 narrows the target. It asks whether a system can face a genuinely new environment and figure out how it works without hints.

Interactive games replace static puzzles

The major change in ARC-AGI-3 is its interactive format. Rather than presenting static problems, the preview uses mini-games inside a grid world. The agent has to act, observe what happens, update its understanding, and try again.

That makes the benchmark less like answering a question and more like learning a small environment. To win, an AI system must infer both the rules and the objective. It must learn through trial and error, then turn that learning into a plan.

The developers frame this as closer to how humans learn. People explore, test possibilities, notice cause and effect, and adjust their behavior. ARC-AGI-3 is built to see whether AI agents can do something similar in a constrained setting where the needed knowledge is basic but the situation is new.

The project team describes the remaining gap in direct terms: "As long as that gap remains, we do not have AGI."

Humans still have the advantage

The Developer Preview includes three interactive test games. According to the creators and the leaderboard, humans can solve these games quickly and easily. AI systems, by contrast, have consistently failed to beat any of the games so far, apart from one entry whose origins are unknown.

That contrast is the central point of the preview. The games are not presented as language-heavy tasks or knowledge contests. They are meant to test basic adaptation. Yet that is exactly where current systems continue to struggle.

OpenAI researcher Zhiqing Sun claims on X that the new ChatGPT agent can already solve the first game. The source article notes, however, that it is unclear whether OpenAI's agent is the system holding the top position on the leaderboard.

That uncertainty leaves the broader picture unchanged. The benchmark's public preview shows that solving even compact, rule-discovery games remains difficult for AI agents when the task requires independent learning in an unfamiliar setup.

Why the format raises the bar

ARC-AGI-3 is not only asking whether an AI system can produce a correct answer. It is asking whether the system can discover what a correct answer even means inside a new environment. That difference raises the difficulty.

In these games, an agent must handle several steps at once:

Explore the grid world without prior explanation of the rules.
Infer cause and effect from the results of its own actions.
Identify the goal of the game through interaction.
Use trial and error to improve its behavior.
Plan actions that lead to success rather than isolated progress.

Those requirements follow logically from the benchmark design. A system that only reacts locally may not be enough. A system that cannot revise its assumptions after failed attempts will also struggle. The format rewards flexible learning more than surface pattern matching.

That is why the comparison with humans is so revealing. The games are described as easy for people, which suggests the tasks draw on ordinary reasoning abilities rather than specialized expertise. ARC-AGI-3 therefore makes the gap visible in a way that is simple to understand: people can adapt, while today's AI systems mostly cannot.

Competition and what comes next

The preview is also tied to a sprint competition sponsored by HuggingFace. The competition offers a $10,000 prize, and participants have four weeks to build and submit the strongest agent using the provided API.

The benchmark is expected to expand significantly. By early 2026, the full version is supposed to include about a hundred different games, divided into public and private test sets. More information about the benchmark, participation, and the API is available at arcprize.org.

For now, ARC-AGI-3 gives researchers and developers a focused way to test a question that remains central to AI progress. Can an AI system enter a new environment, learn from interaction, and solve a problem without being handed the relevant background? In this preview, humans still do that easily. The latest AI models still have work to do.