What 16 Claude AI agents proved by building a C compiler

Sixteen Claude Opus 4.6 agents produced a Rust-based C compiler that could build a bootable Linux 6.9 kernel. The project also showed the limits of autonomous coding: it required careful human scaffolding, strong tests, and constant verification.

WTF Index TERMINATOR
◄ Terminator 2 Idiocracy 1 ►

The story mildly leans toward Terminator because it shows increasingly capable autonomous agent teams completing complex software work, while emphasizing human oversight and limits.

What 16 Claude AI agents proved by building a C compiler

Anthropic researcher Nicholas Carlini gave 16 Claude Opus 4.6 agents a demanding software task: build a C compiler from scratch. After two weeks, nearly 2,000 Claude Code sessions, and about $20,000 in API fees, the agents produced a 100,000-line Rust-based compiler with some striking results and some equally important limits.

The experiment is a useful signal for the future of AI agents in software development. It shows that agent teams can coordinate on a large technical project, but it also shows why human judgment, test design, and verification remain central.

What the AI agents built

The project used 16 instances of Claude Opus 4.6 working on a shared codebase. Each instance ran in its own Docker container, cloned the same Git repository, claimed tasks through lock files, and pushed completed code upstream.

There was no orchestration agent assigning work. Each Claude instance chose what looked like the most obvious problem, worked on it, and handled merge conflicts when they appeared.

The result was a C compiler written in Rust. According to the source, it could build a bootable Linux 6.9 kernel on x86, ARM, and RISC-V architectures. Anthropic also released the compiler on GitHub.

The compiler could handle several major open source projects, including PostgreSQL, SQLite, Redis, FFmpeg, and QEMU. It reached a 99 percent pass rate on the GCC torture test suite and compiled and ran Doom, which Carlini described as “the developer’s ultimate litmus test.”

Why a C compiler was a good target

A C compiler is a difficult project, but it is also unusually well suited to semi-autonomous AI coding. The C language has a long-established specification, existing test suites, and reference compilers that can be used for comparison.

That matters because AI agents need clear feedback. In this case, the agents could repeatedly test whether their compiler behaved correctly. They were not being asked to define a vague product, negotiate unclear requirements, or guess what users might want.

Most software projects are less tidy. The hard question is often not whether code passes a known test. It is what the test should be, what behavior matters, and which tradeoffs are acceptable. The compiler experiment worked inside a domain where those questions were unusually concrete.

  • Clear target: build a working C compiler.
  • Existing references: GCC could be used to compare behavior.
  • Strong tests: the GCC torture test suite gave the agents direct feedback.
  • Visible milestones: compiling projects and booting Linux gave practical proof points.

The limits were part of the result

Carlini was direct about the compiler’s weaknesses. It lacks a 16-bit x86 backend needed to boot Linux from real mode, so it calls out to GCC for that step. Its own assembler and linker remain buggy.

The compiler also produces less-efficient code than GCC running with all optimizations disabled, even when its own optimizations are enabled. The Rust code works, but the source says it does not match what an expert Rust programmer would produce.

Carlini wrote, “The resulting compiler has nearly reached the limits of Opus’s abilities.” He also wrote, “I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.”

That failure mode is important. As the codebase approached around 100,000 lines, changes became harder to make without damaging existing behavior. The source connects this to a broader issue with AI coding agents: they can lose coherence over time.

In other words, the experiment did not simply prove that AI agents can write a large codebase. It also suggested a practical ceiling for current models when the system becomes large enough that no contributor fully understands it.

The hidden human work

The project was described by Anthropic as a “clean-room implementation” because the agents had no Internet access during development. The source notes that this framing is contested because the model had been trained on large amounts of publicly available source code, likely including GCC, Clang, and smaller C compilers.

The human scaffolding was also significant. Carlini built the environment that made the agents productive: test harnesses, continuous integration pipelines, and feedback systems designed around language model failure modes.

One issue was context pollution. Verbose test output could fill the model’s context window and cause it to lose track of the task. Carlini responded by creating test runners that printed only a few summary lines while logging details separately.

Another issue was time. Claude has no sense of time, according to the source, and could spend hours running tests without progress. Carlini created a fast mode that sampled only 1 percent to 10 percent of test cases.

When all 16 agents became stuck on the same Linux kernel bug, Carlini used GCC as a reference oracle. Most kernel files were randomly compiled with GCC, while only a subset used Claude’s compiler, allowing agents to work on different bugs in different files.

“Claude will work autonomously to solve whatever problem I give it,” Carlini wrote. “So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.”

What this means for software teams

The strongest lesson is not that AI agents can replace software teams. The stronger lesson is that agentic software development depends heavily on the quality of the surrounding system.

When the task is specific, the tests are strong, and the feedback loop is carefully managed, AI agents can produce substantial code. When the verifier is weak, the agents may optimize for the wrong outcome. When the project grows too complex, fixes can break earlier behavior.

The $20,000 cost also needs context. That figure covered API token costs only. It did not include the billions spent training the model, Carlini’s work building the scaffolding, or the decades of compiler engineering behind the tests and reference implementations.

Carlini himself sounded both impressed and uneasy. He wrote, “Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026.” He also warned that “the thought of programmers deploying software they’ve never personally verified is a real concern.”

That concern is the practical takeaway. AI agents can now do more than small coding chores. But the more code they produce, the more important it becomes for humans to verify what was built, understand the limits, and design systems that keep the agents pointed at the right problem.