TechCrunch AI January 23, 2025 NEUTRAL

Why Humanity’s Last Exam is testing frontier AI

Humanity’s Last Exam is a new benchmark from the Center for AI Safety (CAIS) and Scale AI for testing frontier AI systems. In a preliminary study, no publicly available flagship AI system scored better than 10%.

WTF Index NEUTRAL

◄ Terminator 1 Idiocracy 0 ►

This is mainly a neutral benchmark story showing frontier AI still struggles on hard evaluations rather than becoming dangerous or degrading society.

Why Humanity’s Last Exam is testing frontier AI

A new AI benchmark is putting frontier AI systems under a sharper lens. Humanity’s Last Exam, released by the nonprofit Center for AI Safety (CAIS) and Scale AI, is designed to test how well leading systems handle difficult questions across several broad fields.

The early signal is blunt: in a preliminary study, not a single publicly available flagship AI system managed to score better than 10% on the benchmark.

A benchmark built to be difficult

Humanity’s Last Exam is not a narrow test focused on one subject area. It includes thousands of crowdsourced questions covering mathematics, humanities, and the natural sciences.

That range matters because frontier AI systems are often presented as general-purpose tools. A benchmark that moves across disciplines can expose whether a system is broadly capable, or whether its performance depends heavily on the type of task in front of it.

The benchmark also uses multiple question formats. Some questions incorporate diagrams and images, which makes the evaluation more demanding than a text-only test.

That design choice reflects a key challenge for AI evaluation: real reasoning tasks do not always arrive as clean lines of text. Some require a system to interpret visual information, connect it to a question, and produce an answer that fits the problem.

What the early results show

The preliminary study reported by CAIS and Scale AI found that no publicly available flagship AI system scored better than 10% on Humanity’s Last Exam.

That result does not mean every AI system failed in the same way, and the source does not provide model-by-model details. But it does show that the benchmark was difficult enough to keep even leading public systems below a low scoring threshold.

For readers following AI progress, the result is a reminder that impressive performance in one setting does not automatically translate into strong performance everywhere. A system may appear capable on familiar tasks while still struggling with questions that combine depth, format variety, and subject breadth.

The benchmark’s name is deliberately ambitious, but the practical point is straightforward: CAIS and Scale AI have created a test that current publicly available flagship AI systems have not yet cleared at a high level.

Why question format matters

Many AI evaluations are shaped by the kinds of inputs they use. Humanity’s Last Exam raises the difficulty by including questions in multiple formats, including diagrams and images.

That detail is important because visual information can change the nature of a question. A model may need to understand a diagram before it can even begin solving the problem. It may also need to avoid treating an image as decoration when the image is central to the answer.

The inclusion of mathematics, humanities, and the natural sciences also means the benchmark is not measuring one narrow skill. It is testing whether frontier AI systems can move between different forms of knowledge and different styles of reasoning.

Mathematics can require exact problem solving and careful steps.
Humanities can require interpretation across broader context.
Natural sciences can require applying concepts to structured problems.

The source does not say which areas were hardest. Still, the combination of subjects and formats helps explain why the benchmark may be more challenging than simpler tests.

Opening the benchmark to researchers

CAIS and Scale AI say they plan to open up Humanity’s Last Exam to the research community. Their stated goal is to let researchers “dig deeper into the variations” and evaluate new AI models.

That next step is important for the benchmark’s usefulness. A hard test becomes more valuable when researchers can examine where systems succeed, where they fail, and how new models compare under the same conditions.

Opening the benchmark can also help separate broad claims about AI capability from measured performance. If researchers can use the same test on new AI models, the results can become part of a more consistent discussion about progress.

For now, Humanity’s Last Exam stands as a demanding new reference point. It does not settle every question about frontier AI, but it gives researchers a tougher way to ask one of the central questions in the field: how much can today’s systems actually handle when the questions become broad, varied, and difficult?