Why ARC-AGI-2 is exposing limits in leading AI models

ARC-AGI-2, a new benchmark from the Arc Prize Foundation, is proving difficult for many leading AI models. Human panels averaged 60%, while several prominent models scored close to 1%, raising fresh questions about efficient reasoning and general intelligence.

WTF Index NEUTRAL
◄ Terminator 0 Idiocracy 1 ►

The story mainly highlights current AI reasoning limits rather than clear danger or societal degradation.

Why ARC-AGI-2 is exposing limits in leading AI models

A new artificial intelligence benchmark is putting pressure on some of the most prominent AI models. ARC-AGI-2, created by the Arc Prize Foundation, is designed to test whether AI systems can adapt to unfamiliar visual problems rather than lean on memorized patterns or heavy computing power.

The early results are stark. Several leading models are scoring around 1%, while human panels reached an average of 60% on the same test.

What ARC-AGI-2 Is Testing

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced ARC-AGI-2 in a blog post on Monday. The test is the foundation’s latest attempt to measure general intelligence in AI models.

The benchmark uses puzzle-like tasks built from grids of different-colored squares. An AI model must look at examples, identify the visual pattern, and produce the correct answer grid.

The important point is not just whether a model can solve a puzzle. The test is meant to show whether an AI system can handle a new kind of problem it has not already seen during training.

That distinction matters for artificial general intelligence. The Arc Prize Foundation’s tests focus on whether an AI system can efficiently acquire new skills outside the data it was trained on. In plain terms, ARC-AGI-2 asks whether a model can figure out a rule in the moment, then apply it correctly.

How Leading Models Performed

So far, ARC-AGI-2 has been difficult for most models on the Arc Prize leaderboard. “Reasoning” AI models such as OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3%.

Powerful non-reasoning models, including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash, score around 1%.

The human baseline looks very different. The Arc Prize Foundation had over 400 people take ARC-AGI-2. On average, “panels” of those people got 60% of the test’s questions right.

Those results do not simply show that the test is hard. They show a wide gap between current model performance and human performance on a task built around fast adaptation to unfamiliar visual patterns.

Why Efficiency Became Central

ARC-AGI-2 was created partly in response to a weakness in the first version of the test, ARC-AGI-1. Chollet said the new benchmark is a better measure of an AI model’s actual intelligence because it prevents models from relying on “brute force” through extensive computing power.

That issue had already become visible with ARC-AGI-1. The earlier benchmark was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3. That model outperformed all other AI models and matched human performance on the evaluation.

But the performance gains came with a high cost. The version called o3 (low), which reached new heights on ARC-AGI-1 with a score of 75.7%, scored only 4% on ARC-AGI-2 while using $200 worth of computing power per task.

ARC-AGI-2 introduces efficiency as a new metric. It also requires models to interpret patterns on the fly instead of depending on memorization.

“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, ‘Can AI acquire [the] skill to solve a task?’ but also, ‘At what efficiency or cost?’”

What The Benchmark Says About AI Progress

The arrival of ARC-AGI-2 comes as many people in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. The concern is that existing tests may no longer reveal enough about the traits associated with artificial general intelligence.

Hugging Face’s co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure key traits of artificial general intelligence, including creativity.

ARC-AGI-2 fits into that debate because it shifts attention away from raw scores alone. A model that can eventually solve a task with enough compute may still be far from showing the kind of flexible, efficient learning the benchmark is designed to probe.

The Arc Prize Foundation is also tying the benchmark to a new competition. Alongside ARC-AGI-2, it announced a new Arc Prize 2025 contest. The challenge asks developers to reach 85% accuracy on ARC-AGI-2 while spending only $0.42 per task.

That target captures the central message of the new test. The next stage of AI evaluation is not only about whether a model can find the right answer. It is also about how quickly, cheaply, and adaptively it can get there.