Carlini wrote in the post published on February 5, 2026, “We tasked Opus 4.6 using agent teams to build a C Compiler, and then (mostly) walked away. Here’s what it taught us about the future of autonomous software development.” He called this setup “agent teams,” where several Claude copies run in parallel on the same codebase without constant human input. The result was a 100,000-line compiler that builds Linux 6.9 on x86, ARM, and RISC-V architectures. It also handles other big projects like QEMU, FFmpeg, SQLite, Postgres, Redis, and even Doom.
How the agent teams worked
Carlini set up a basic system with Docker containers and a shared Git repo. Each agent got its own workspace, claimed tasks by creating lock files, fixed bugs or added features, then pushed changes and cleared the lock. If conflicts happened during merges, the AI sorted them out itself. No boss agent told them what to do at a high level. Agents just picked the “next most obvious” problem. Some specialised one handled duplicate code, another focused on performance, a few critiqued design like a Rust expert would, and others updated docs.
He stressed good tests were key. “Write extremely high-quality tests,” Carlini said, because bad ones let the AI chase wrong goals. They used real compiler test suites, built verifiers for open-source software, and added continuous integration to stop new changes from breaking old stuff. When the Linux kernel proved too big a single task, they used GCC as a helper oracle to split work so agents fixed different files at once.
What it revealed about limits
The compiler works well in many ways, it passes 99% on most test suites and compiles a bootable kernel on three architectures. But it has gaps. It skips 16-bit x86 for real mode booting by calling GCC instead, lacks its own full assembler and linker (still buggy), and generates code less efficient than even unoptimized GCC. Code quality is decent but not expert level. Carlini tried hard to fix some issues but hit walls. “The resulting compiler has nearly reached the limits of Opus’s abilities,” he noted.
This was a test to see what today’s models can barely pull off. Earlier Opus versions struggled more, but 4.6 crossed some thresholds. Carlini admitted surprise: “I did not expect this to be anywhere near possible so early in 2026.” He feels excited about possibilities for big projects but uneasy too. Without humans checking every bit, mistakes can slip through even if tests pass. He worries about deploying code never personally verified, drawing from his past in penetration testing.
The source code is out there for anyone to download and try. Carlini plans to let Claude keep improving it. This experiment shows how fast things are moving in autonomous coding, but it also highlights why careful oversight still matters in this new world.