Why ARC-AGI gains are reshaping the AI benchmark debate

Poetiq reports major gains on ARC-AGI benchmarks, including 75 percent accuracy on the public ARC-AGI-2 test dataset using OpenAI's GPT-5.2 X-High. The results show how AI systems are turning abstract reasoning tests into targets for runtime adaptation, code generation and iterative optimization.

Why ARC-AGI gains are reshaping the AI benchmark debate

ARC-AGI was built to test whether AI systems could learn new abstract tasks instead of simply repeating patterns from training data. Recent results from Poetiq suggest that even this once-resistant benchmark is now being pushed hard by modern reasoning systems.

The change does not settle the question of artificial general intelligence. It does show that AI labs are getting better at converting difficult benchmarks into engineering problems, then attacking them with specialized systems that adapt while they work.

Poetiq pushes ARC-AGI-2 beyond earlier results

In an update from December 25, 2025, Poetiq said it had reached 75 percent accuracy on the public ARC-AGI-2 test dataset using OpenAI's GPT-5.2 X-High. The source describes that score as roughly 15 percentage points above the previous best and well beyond human level performance.

The reported cost was under $8 per task, which the source says was significantly cheaper than before. Poetiq said the X-High variant may cost less per task than the High variant because the model arrives at correct answers faster.

The company attributed the result to changes in both prompt and code, which it describes as the system's "reasoning strategy." Poetiq also said GPT-5.2 did not receive special training or model-specific adjustments for the benchmark, and the company plans to release the code soon.

The underlying method is not a simple one-shot answer. Poetiq's solver guides GPT-5.2 to write code for each individual task, runs that code, checks whether it works, and fixes errors. It also combines multiple independent runs to make the final answer more reliable.

Why ARC-AGI mattered in the first place

The benchmark began as the "Abstraction and Reasoning Corpus" and was later renamed ARC-AGI. François Chollet introduced ARC in 2019 as a way to measure "skill acquisition efficiency" rather than memorization.

That aim made ARC unusual. Many AI benchmarks can be improved by training larger models on more data, but ARC was designed around colorful grid puzzles that require a system to infer a rule from examples and apply it to a new case.

For years, researchers had limited success on these tasks. While language models performed strongly on other tests, ARC remained difficult enough that some treated it as a "North Star" for AGI research and others saw it as evidence that scaling alone had limits.

The source describes a major shift in December 2024, when OpenAI's o3-preview scored over 75 percent on ARC-AGI-1. From that point, the benchmark increasingly looked less like an untouched measure of abstraction and more like a target for specialized reasoning, search and reinforcement learning.

Optimization is changing what the scores mean

Poetiq's earlier results already suggested that ARC-AGI-1 was effectively saturated. The company said systems built on models such as OpenAI's and Google's had maxed out performance on the first dataset, while also beating the human average of 60 percent on the harder ARC-AGI-2 dataset.

The approach combines advanced language models, including Gemini 3 and GPT-5.1, with open-source models inside a custom architecture. According to Poetiq, the system works through an iterative loop:

  • It generates possible solutions.
  • It evaluates feedback from those attempts.
  • It refines answers through a self-audit before producing the final result.

That loop is important because it reframes abstraction as an active process. Instead of asking a model to answer from stored knowledge alone, the system can generate code, test ideas, revise mistakes and repeat the process.

Efficiency is improving at the same time. Poetiq said its "Poetiq (GPT-OSS-b)" system, based on GPT-OSS-120B, reaches over 40 percent accuracy on ARC-AGI-1 for less than a cent per task. The source also points to the non-LLM "Tiny Recursive Model" as another sign that ARC performance is no longer only about massive compute.

Public benchmarks still carry a contamination risk

The strongest reported numbers apply to public datasets, not the semi-private sets held back by ARC administrators. That distinction matters because public benchmarks can appear in the training data used for large models.

The source identifies this as "data contamination." If a model has indirectly seen benchmark material during training, a high score may not prove that it can generalize to truly new tasks.

Poetiq's own analysis says many underlying LLMs perform much worse when moving from public evaluation sets to semi-private ones. The company expects a similar drop for its own systems on ARC-AGI-1 for that reason.

ARC-AGI-2 may be less vulnerable to the same problem. Poetiq describes its sets as "more tightly calibrated" and says its system was never trained on ARC-AGI-2 tasks, while noting that the foundation models it uses might have been.

A benchmark can fall and still matter

Chollet views recent progress as evidence of a strategic change in AI development. He described results from reasoning models like o3 as a "a surprising and important step-function increase in AI capabilities," while arguing that the field has moved into test-time adaptation.

In this framing, models are no longer just static responders. They can adapt at runtime through processes related to program synthesis and chain-of-thought reasoning, reworking their approach for the specific problem in front of them.

That does not mean ARC success equals AGI. The source says Chollet still sees solving ARC as a necessary step toward AGI, but not AGI itself. Current models still fail basic tasks and lack deep understanding of the world.

The more grounded conclusion is that ARC-AGI did its job. It forced AI research to focus on reasoning and adaptation. If ARC-AGI-1 is effectively solved and ARC-AGI-2 is now under pressure, that is not simply a failure of the benchmark. It is also evidence that the benchmark successfully redirected effort toward harder forms of problem solving.