EVMbench puts AI smart contract security agents to the test

OpenAI and Paradigm built EVMbench to measure how AI agents handle Ethereum smart contract vulnerabilities. The benchmark shows strong exploitation ability, but finding vulnerabilities in large codebases remains the hardest part.

WTF Index TERMINATOR
◄ Terminator 4 Idiocracy 0 ►

AI agents autonomously exploiting smart contract vulnerabilities creates clear dual-use security risk even if detection remains limited.

EVMbench puts AI smart contract security agents to the test

AI agents are moving deeper into software security work, and Ethereum smart contracts are now a clear test case. EVMbench, built by OpenAI and crypto investment firm Paradigm, measures whether AI agents can find, fix, and exploit vulnerabilities in smart contracts.

The results point to a mixed future for smart contract security. AI agents can already exploit many flaws once they are in reach, but the benchmark also shows that discovering the right weak point inside a large codebase is still the main bottleneck.

What EVMbench Measures

EVMbench is focused on Ethereum smart contracts and the kinds of security issues that appear in real-world audits. The dataset covers 120 vulnerabilities drawn from 40 real-world security audits.

That matters because the benchmark is not only asking whether a model can answer a security question in isolation. It is testing agent behavior across three practical tasks:

  • Finding vulnerabilities in smart contract code.
  • Fixing those vulnerabilities.
  • Exploiting vulnerabilities through an attack.

The most realistic test setup goes further than a static code challenge. In that setup, AI agents interact with a local blockchain and have to carry out attacks entirely on their own.

This design makes EVMbench relevant to both defenders and attackers. A system that can autonomously exploit a smart contract vulnerability could help security teams validate risks before deployment. The same capability could also become dangerous if used against contracts that are already live.

The Strongest Results Came From Exploitation

The top-performing model in the benchmark was GPT-5.3-Codex. It successfully exploited 72 percent of the vulnerabilities and fixed 41.5 percent.

For detection, Claude Opus 4.6 came out ahead at 45.6 percent. That split is important: the best model for exploiting vulnerabilities was not reported as the best model for detecting them.

In plain terms, the benchmark suggests that AI agents may be better at acting on a vulnerability than at discovering it from scratch. Once the problem is located, the agent has a narrower task. It can focus on proving the weakness or producing a repair instead of searching through a broader codebase.

This is why EVMbench should not be read only as a scoreboard of model performance. It also shows where the current limits are. Exploiting and fixing are difficult, but the hardest step is often getting the agent to identify the relevant issue in the first place.

Finding the Flaw Is the Hard Part

The researchers say the biggest challenge for AI agents is not exploiting or fixing vulnerabilities. It is finding them in large codebases.

The benchmark results make that point sharply. When agents were given hints about where a vulnerability was located, exploit success rates jumped from 63 to 96 percent. Fix rates climbed from 39 to 94 percent.

Those jumps show how much performance depends on narrowing the search space. A hint changes the task from open-ended security discovery to targeted execution. The agent no longer has to decide where the meaningful risk is hiding; it can concentrate on the code region that matters.

For security teams, that distinction is central. An AI agent that performs well with hints may be useful in a workflow where humans, tools, or audits already point to suspicious areas. A fully autonomous system has a harder job because it must decide what to inspect before it can exploit or repair anything.

The same distinction also matters for risk. If a bad actor can guide an agent toward vulnerable code, the benchmark suggests exploitation becomes much easier. If the agent must search without guidance, its success is more limited.

Why This Matters for Smart Contract Security

The stakes are high because smart contracts hold significant value. The authors point to over $100 billion locked in smart contracts.

That makes improvements in smart contract security valuable. Better AI agents could help uncover issues before they are abused, support audit work, and speed up vulnerability repair. In that scenario, EVMbench becomes a way to measure progress toward stronger defensive tools.

But the same findings also carry a warning. If AI agents can exploit a large share of known vulnerability types in a realistic local blockchain setup, then those capabilities are not only useful to defenders. They may also increase the risk when placed in the wrong hands.

The core lesson is not that AI has solved smart contract security. EVMbench shows a more specific picture: current agents can be powerful once the target is clear, but broad vulnerability discovery remains difficult. That gap is where much of the next security challenge sits.

The Near-Term Takeaway

EVMbench gives the field a clearer way to evaluate AI agents on Ethereum smart contract security. It measures detection, repair, and exploitation against vulnerabilities taken from real-world audits, and it tests agents in a setup where they must interact with a local blockchain.

The results show meaningful capability, especially in exploitation. They also show that detection in large codebases is still the limiting factor.

For builders and security teams, the practical message is straightforward. AI agents may become useful parts of smart contract security workflows, but their strongest performance appears when the problem is already narrowed down. The more open-ended the task, the harder it remains.