Why Claude is forcing Anthropic to rethink its hiring test

Anthropic has redesigned its take-home test for performance engineers three times because newer Claude models kept outperforming candidates. The latest version moves away from realistic work tasks and toward unfamiliar constraints that better reveal human problem-solving.

WTF Index IDIOCRACY
◄ Terminator 1 Idiocracy 3 ►

The story highlights AI eroding the usefulness of human skill tests and making real engineering ability harder to evaluate.

Why Claude is forcing Anthropic to rethink its hiring test

Anthropic’s recruiting process is now running into a problem that says a lot about the state of AI-assisted software work: its own model, Claude, keeps getting too good at the hiring test.

According to a blog post by Tristan Hume, who leads Anthropic’s performance optimization team, the company has had to revamp its take-home test for performance engineers three times. Each time, a newer Claude model made the previous version less useful for evaluating candidates.

A hiring test built to mirror real engineering work

The original test was designed around a Python simulator for a fictional chip. Candidates received a working program and were asked to rewrite it so it would run faster.

The score came down to clock cycles, meaning the number of computational steps the simulated computer needed to complete the task. A solution that used fewer steps performed better.

More than 1,000 candidates have completed the original test since early 2024. Hume designed it to reflect the kind of work engineers might actually do at Anthropic, which also meant candidates were allowed to use AI tools during the exercise.

That choice mattered. The goal was not to evaluate candidates in an artificial environment where AI assistants were banned. It was to understand how they worked in a setting closer to the job itself.

The test also appears to have served Anthropic well for a time. Many candidates voluntarily continued beyond the four-hour time limit because they found the challenge engaging, and dozens of engineers hired through this process now work on Anthropic’s infrastructure.

Claude kept catching up to the candidates

The issue did not appear all at once. It emerged as Claude improved.

With Claude 3.7 Sonnet, Anthropic saw that more than half of candidates would have scored better if they had handed the entire assignment to Claude rather than writing the code themselves. That undercut the value of the test as a way to separate strong engineering judgment from simple model use.

By May 2025, Claude Opus 4 was outperforming nearly all human solutions within the time limit. Hume responded by changing the test and reducing the allotted time to two hours.

Then Claude Opus 4.5 created the next problem. Within two hours, the model could match the scores of the best human candidates. Humans still have an edge without a time limit, sometimes by a wide margin, but that advantage becomes harder to capture in a realistic take-home format.

This is the central tension Anthropic is facing: a test needs to be bounded enough to fit into a hiring process, but AI models are increasingly strong inside those same bounded conditions.

Why banning AI was not the answer

One obvious response would be to prohibit AI tools during the assessment. Hume considered that option, but rejected it because it would not match the real job.

On the job, Anthropic engineers use AI assistants. A hiring process that pretends otherwise would test a narrower and less relevant skill set.

That makes the performance engineer test part of a larger hiring challenge for AI companies and technical teams: if candidates will use AI at work, then the assessment should measure how well they use it. But if the model can solve the exercise on its own, the test no longer reveals enough about the candidate.

The question becomes less about whether someone can solve a coding task in isolation, and more about whether the task itself is still capable of exposing human reasoning, taste, and persistence.

The new test moves away from realism

Anthropic’s final answer was not another small tweak. Hume shifted to a different kind of assessment inspired by programming puzzle games from Zachtronics, which are known for unusual and heavily constrained programming environments.

In those games, players work with minimal commands and limited memory. The restrictions force creative solutions because the environment does not resemble ordinary programming work.

The new Anthropic test uses similar constraints. Claude struggles with these tasks because they barely appear in its training data.

"Realism may be a luxury we no longer have," Hume writes.

That line captures the tradeoff. The original test worked because it looked like real work. The new test works because it creates novel work that neither humans nor AI have already seen.

For hiring, that is a major shift. A realistic test can show how someone might perform day to day, but only if the task is still hard for the tools available to them. When AI systems become strong enough to dominate the exercise, novelty becomes part of the evaluation.

Humans still have room to win

Anthropic has published the original test on GitHub. That move also turns the old assessment into a benchmark of sorts, because people can still try to beat Claude’s best performance.

The source makes clear that humans are not simply out of the picture. Given unlimited time, the fastest human solution ever submitted still beats Claude’s best performance by a significant margin.

Anthropic is also keeping that path open. Anyone who submits a more efficient solution than Claude can apply directly to Anthropic, while everyone else can apply through the normal process and take the new test.

The deeper lesson is that AI-assisted hiring tests may need to evolve as quickly as the tools they allow. Anthropic’s experience shows the difficulty of designing a fair, useful assessment when the same AI systems used at work can also outperform many candidates on the test itself.