EVMbench, a new benchmark developed by OpenAI and Paradigm, is intended to assess AI agents' abilities to identify, address, and even take advantage of flaws in Ethereum Virtual Machine (EVM) smart contracts This article explores evmbench security measures. . This tool comes at a crucial moment as AI advancements threaten to empower both defenders and attackers in blockchain security, with smart contracts protecting over $100 billion in cryptocurrency assets.

EVMbench incorporates scenarios from the Tempo blockchain audit, which focuses on high-throughput stablecoin payments, and draws from 120 real-world vulnerabilities across 40 audits, many of which were from Code4rena competitions. The benchmark simulates actual cybersecurity workflows by putting AI models through three main modes. Similar to human auditors who claim rewards, agents in "Detect" mode examine contract repositories and receive scores based on their ability to recall ground-truth flaws.

"Patch" mode tests their ability to modify code while maintaining functionality and preventing exploits, as confirmed by automated tests. Transaction replays and on-chain checks are used to grade the most difficult "Exploit" mode, which deploys contracts in a sandboxed Anvil environment where agents launch fund-draining attacks. By restricting risky RPC calls and separating tasks from active networks, a custom Rust harness guarantees reproducibility.

Frontier models are improving, particularly in terms of exploitation, according to preliminary testing. Six months ago, GPT-5 had a score of 31.9% in exploit mode via Codex CLI; six months later, OpenAI's GPT-5.3-Codex had a score of 72.2%. However, detection recall and patch success are slow because agents frequently stop after finding a fix or have trouble with subtle fixes that don't break the code. These flaws demonstrate AI's superiority over thorough audits in goal-driven attacks.

Limitations and Design of Benchmarks The developers of EVMbench modified exploit scripts and proofs, used AI auditors for quality control, and red-teamed graders to stop cheating. However, unlike top protocols, it lacks mainnet complexity, timing dependencies, single-chain configurations, and complete scrutiny. While Exploit depends on mock states, Detect mode runs the risk of overlooking human-overlooked problems or reporting false positives.

The growing AI-cyber risks in blockchain are highlighted by this release. By combining EVMbench with security measures like safety training, trusted access, and Aardvark, a security research agent in private beta, OpenAI emphasizes defensive use. Through their Cybersecurity Grant Program, they have committed $10 million in API credits for critical infrastructure and open-source defenses. They hope to mobilize developers against exploits as agents change by making tasks and tools open-source.

AI has the potential to revolutionize smart contract security, but only if it is used proactively, according to EVMbench. Integrating models into audits is crucial to safeguarding billions of assets because they are highly effective at attacks. Make ZeroOwl your Google Preferred Source.