MirrorCode is a new coding benchmark from Epoch AI and METR that asks a harder question than whether an AI model can patch a bug or solve a short programming challenge. It tests whether models can rebuild complete programs from scratch, without access to the original source code, and match the original behavior closely enough to pass hidden end-to-end tests.
The results show both real progress and clear limits. Claude Opus 4.7 leads the benchmark, while every tested model still fails on the most complex tasks.
What MirrorCode Measures
MirrorCode is built around 25 target programs. The programs span Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression.
The task is not simply to write code that looks plausible. Each AI-generated solution must reproduce the output of the original program. That includes tests the model does not see while it is developing its answer.
This makes the benchmark a test of sustained software reconstruction. A model has to infer how a program should behave, implement that behavior, and keep working across a larger codebase than typical coding tests require.
MirrorCode also differs from many software engineering benchmarks in the amount of compute it allows. The source describes existing benchmarks as often limiting task costs to $1 to $10, even when a human might need weeks for a comparable job. MirrorCode includes much larger runs.
The 19-Day Run Shows the Scale
One of the largest MirrorCode tasks cost $2,600 for a single run, according to Epoch AI. The AI worked continuously for 19 days with no human involvement.
That detail matters because it changes the frame for evaluating AI coding systems. A short benchmark can show whether a model can solve compact tasks. A long run tests whether the model can keep making progress when the work stretches across many steps, files, and decisions.
The benchmark therefore sits closer to the kind of software work where persistence matters. A program has to be rebuilt as a working whole, not just answered as a prompt.
But the result also shows that more time and money do not automatically solve the hardest cases. The largest tasks in MirrorCode remain unsolved by all tested models.
Claude Opus 4.7 Leads the Field
Claude Opus 4.7 is the top performer reported in the source, with a 56 percent solve rate. GPT-5.5 follows at 44 percent, and Gemini 3.1 Pro Preview reaches 32 percent.
The strongest example is Claude Opus 4.7 reimplementing gotree, a bioinformatics toolkit. The toolkit has roughly 16,000 lines of Go code and over 40 commands.
Epoch AI says a human engineer working without AI help would need 2 to 17 weeks for that same job. Claude Opus 4.7 finished in 14 hours for $251.
That does not mean the benchmark is solved. Even when models fail to fully reimplement a program, they typically pass 90 percent or more of the tests. In practical terms, that is impressive and still incomplete. Passing most tests is not the same as reproducing a whole program correctly.
Where The Models Still Break
MirrorCode divides tasks into small, medium, and large categories. Small programs such as uuid or parseqsv are reliably reimplemented by all tested models.
The largest tasks are different. They beat every model tested, which shows that current AI coding systems still struggle when the target program becomes sufficiently complex.
The researchers are seeing rapid gains, however. Epoch AI says leading models from a year ago would have scored only about 30 percent and would have been limited to simpler programs such as a calendar utility.
Cost is not moving in a simple direction. GPT-5.5 costs three times as much as GPT-5 for the same tasks, while Claude Opus 4.7 runs three times cheaper than Claude Opus 4.1.
That mixed cost picture is important for anyone comparing models on long software tasks. A higher score, a lower bill, and a faster run do not always arrive together.
Open Source Targets Come With A Caveat
Epoch AI has open-sourced the scaffold and 22 of the 25 target programs. The released set covers 132 task instances across six programming languages. Three programs are kept private for testing.
There is one major caveat. MirrorCode uses open-source programs as targets, so the models may have encountered the original code during training.
the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance
That caveat does not erase the benchmark results, but it affects how they should be read. MirrorCode is evidence that models can perform long, demanding programming work under these conditions. It is not proof that they always reasoned out every implementation from first principles.
The clearer takeaway is narrower and more useful: AI coding systems are becoming capable of rebuilding sizable programs, sometimes quickly and at notable cost, but the most complex reconstruction tasks still expose hard limits.