Building a C Compiler with a Team of Parallel Claudes

Published: Feb 05, 2026

Author: Nicholas Carlini, a researcher on Anthropic's Safeguards team.

Overview

The article describes an experiment with "agent teams" — multiple Claude instances working in parallel on a shared codebase without active human intervention. The author tasked 16 agents with writing a Rust-based C compiler from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

The compiler is available at github.com/anthropics/claudes-c-compiler.

Enabling Long-Running Claudes

Existing agent scaffolds like Claude Code require an operator to be online. To elicit sustained, autonomous progress, the author built a harness that puts Claude in a simple loop — when it finishes one task, it immediately picks up the next. The bash script uses claude --dangerously-skip-permissions in a while true loop, logging output per commit. The agent prompt instructs Claude to break problems into small pieces, track progress, and keep going. As the author notes, "The loop runs forever—although in one instance, I did see Claude pkill -9 bash on accident, thus killing itself."

Running Claude in Parallel

Running multiple instances in parallel addresses two weaknesses: a single session can only do one thing at a time, and multiple agents allow for specialization.

The implementation creates a bare git repo, and each agent gets a Docker container with the repo mounted. Each agent clones a local copy, works, then pushes changes upstream. A simple synchronization algorithm prevents conflicts:

Claude takes a "lock" on a task by writing a text file to current_tasks/. Git synchronization forces the second agent to pick a different task if two try to claim the same one.
Claude works, pulls from upstream, merges changes, pushes, and removes the lock. Merge conflicts are frequent but Claude handles them.
The infinite loop spawns a new session in a fresh container.

There is no orchestration agent — each Claude agent decides how to act independently, typically picking up the "next most obvious" problem.

Lessons from Programming with Claude Agent Teams

Write Extremely High-Quality Tests

Claude works autonomously, so "it's important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem." The author built a continuous integration pipeline and stricter enforcement as failure modes were identified.

Put Yourself in Claude's Shoes

The test harness was designed for Claude, not humans. Instructions include maintaining extensive READMEs and progress files updated frequently. The author designed around inherent LLM limitations:

Context window pollution: The harness should not print thousands of useless bytes. Important information goes to log files. Logs should be easy to process — "if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it." Aggregate summary statistics are pre-computed.
Time blindness: Claude can't tell time and "will happily spend hours running tests instead of making progress." The harness includes a --fast option running a 1% or 10% random sample, deterministic per-agent but random across VMs.

Make Parallelism Easy

When there are many distinct failing tests, parallelization is trivial — each agent picks a different test. After reaching a 99% pass rate, each agent worked on compiling a different small open-source project (SQLite, Redis, libjpeg, QuickJS, Lua).

Compiling the Linux kernel was harder since it's one giant task — every agent would hit the same bug and overwrite each other's changes. The fix used GCC as an oracle: a new harness randomly compiled most of the kernel with GCC and only remaining files with Claude's compiler, enabling each agent to fix different bugs in different files. Delta debugging was still needed for pairs of files that failed together but worked independently.

Multiple Agent Roles

Parallelism enables specialization: one agent coalesced duplicate code, another improved compiler performance, a third focused on outputting efficient compiled code, another critiqued design from a Rust developer's perspective, and another handled documentation.

Stress Testing the Limits of Agent Teams

The project serves as a capability benchmark. Previous Opus 4 models were barely capable of producing a functional compiler. "Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites." Opus 4.6 was tested to push further.

Evaluation

Over nearly 2,000 sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, costing just under $20,000. This was a clean-room implementation with no internet access, depending only on the Rust standard library.

The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V. It also compiles QEMU, FFmpeg, SQLite, Postgres, Redis, achieves a 99% pass rate on most compiler test suites including the GCC torture test suite, and can compile and run Doom.

Limitations include:

Lacks the 16-bit x86 compiler needed to boot Linux out of real mode (calls out to GCC for this)
Does not have its own assembler and linker — these are still somewhat buggy, and the demo used GCC's assembler and linker
Successfully builds many projects but not all — not yet a drop-in replacement for a real compiler
Generated code is less efficient than GCC with all optimizations disabled
Rust code quality is reasonable but nowhere near expert level

The compiler has nearly reached the limits of Opus's abilities. "New features and bugfixes frequently broke existing functionality." As a particularly challenging example, Opus couldn't implement a 16-bit x86 code generator — while it can output correct 16-bit x86 via opcode prefixes, the output exceeds the 32k code limit Linux enforces, so Claude calls out to GCC for x86 (but compiles ARM and RISC-V completely on its own).

Looking Forward

The author frames agent teams as showing "the possibility of implementing entire, complex projects autonomously," allowing users to become more ambitious with goals. However, fully autonomous development carries real risks — when tests pass, it's easy to assume the job is done, "when this is rarely the case." The author expresses both excitement and unease, noting that "the thought of programmers deploying software they've never personally verified is a real concern."

The rapid progress "opens the door to writing an enormous amount of new code," with positive applications expected to outweigh negative, but requiring "new strategies to navigate safely."

Acknowledgements

Thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and others across Anthropic.

Building a C Compiler with a Team of Parallel Claudes ​

Overview ​

Enabling Long-Running Claudes ​

Running Claude in Parallel ​

Lessons from Programming with Claude Agent Teams ​

Write Extremely High-Quality Tests ​

Put Yourself in Claude's Shoes ​

Make Parallelism Easy ​

Multiple Agent Roles ​

Stress Testing the Limits of Agent Teams ​

Evaluation ​

Looking Forward ​

Acknowledgements ​