AI agents may be more capable than standard benchmark scores suggest. A study by the UK's AI Security Institute (AISI) found that common evaluations can understate what frontier models can do when those tests impose fixed limits on compute.
The core issue is simple: an agent's result is not just a property of the model. It also depends on how much test-time compute the agent is allowed to spend while working through a task.
Why fixed benchmark budgets can miss capability
AISI tested frontier models across seven benchmarks using different compute budgets. The result was that capped budgets can turn a moving performance curve into a single score that looks more settled than it really is.
When an agent is still improving as it spends more tokens, stopping the test early does not show the agent's full capability. It shows what the agent managed to do before the budget ran out.
That matters because standard benchmarks are often treated as a clean measure of model ability. AISI's findings suggest that for AI agents, especially those that can work through complex problems step by step, the score can depend heavily on the budget chosen for the test.
More tokens changed results across key tasks
The study found that giving models more computing time increased success rates by up to 25 percent. The largest practical effects appeared in areas where agents can check their own progress, including cybersecurity and software development.
In cybersecurity, about 8 percent of tasks were solved only after the budget rose above 10 million tokens. Some tasks required 50 million tokens. The newest models reached even higher scores when budgets went above 100 million tokens.
Software engineering showed a similar pattern. On TerminalBench 2.0 and SWE-Bench Pro, success rates rose about 25 percent when the token budget increased from one million to ten million.
Math and academic tasks also improved. On Humanity's Last Exam, performance rose around 22 percent up to a budget of five million tokens.
But the effect was not universal. On HealthBench, a medical task benchmark, all models reached their plateau within the standard budget. According to AISI, extra compute is most useful when agents can verify their work, such as by running code or testing an exploit. Where feedback is unavailable or delayed, more compute has much less impact.
Harder human tasks demand far more tokens
AISI also found a link between how long a human expert would need for a task and how many tokens an AI agent consumes. The study looked at 211 software engineering tasks from METR and 78 cyber tasks from AISI.
Across those tasks, the relationship followed a power law. A one-minute task costs the agent thousands of tokens. A one-hour task costs millions. A one-week task costs billions.
This helps explain why a fixed evaluation budget can distort results. Longer and harder tasks are more likely to be cut short. A failed benchmark attempt may show that the model ran out of budget, not that the model had no route to a solution.
AISI highlighted the cyber task "The Last Ones", which takes a human expert about 20 hours. No tested model solved it with fewer than 30 million tokens.
Newer models gain more from extra compute
The study also found that newer models benefit more from larger budgets than older models. AISI described improvement along three dimensions: reach, reliability, and efficiency.
- Reach: harder tasks become solvable.
- Reliability: the same task is solved more often.
- Efficiency: the same task takes fewer tokens.
For a current frontier model, the time horizon grew from about 40 minutes at 2.5 million tokens to roughly four hours at 50 million tokens. Across the entire frontier, the horizon moved from about two hours to 14 hours when the budget rose from 2.5 to 50 million tokens.
That changes how fast progress appears to be happening. AISI had previously estimated that frontier models' time horizon on cyber tasks doubled roughly every 4.7 months when measured at a fixed budget of 2.5 million tokens. At 50 million tokens, the trend was about 60 percent steeper, with doubling every 40 to 50 days instead of every 67 to 91.
Still, the study did not show smooth improvement everywhere. On about 10 to 30 percent of tasks, newer models performed worse than their predecessors.
What better AI agent testing needs to measure
AISI's broader point is that capability should be measured as a curve over compute, not as a single fixed score. The study puts it directly:
"If we keep treating capability as a fixed score rather than a curve over compute, we will keep being surprised by what these systems can do when more is spent on them."
This has consequences for decisions about deployment, economic value, and risk. If a test budget is too small, the result may make a model look less useful or less risky than it could become when more compute is available.
Falling costs per token could make larger test-time budgets easier to use over time. That would make high-budget behavior more relevant, not less.
AISI now tests frontier models at several budgets. Its approach, called "minimum informative budgets", is meant to show whether a model's reach has stopped growing with additional compute. The team is also working on ways to predict high-budget performance from cheaper test runs.
The message is not that every task improves endlessly with more tokens. It is that AI agent benchmarks need to show where the curve stops. Without that, the score may describe the test setup as much as the system being tested.