Scaling Managed Agents: Decoupling the brain from the hands

Published: Apr 08, 2026

Authors: Lance Martin, Gabe Cemaj, and Michael Cohen. Acknowledgements to Nodir Turakulov, Jeremy Fox, the Agents API team, and Jake Eaton.

Harnesses encode assumptions that go stale as models improve. Managed Agents—a hosted service for long-horizon agent work—is built around interfaces that stay stable as harnesses change.

A recurring theme in Anthropic's engineering blog is building effective agents and designing harnesses for long-running work. A key insight is that harnesses encode assumptions about Claude's limitations, but those assumptions need regular reassessment because they can become outdated as models improve.

As one example, Claude Sonnet 4.5 would wrap up tasks prematurely near its context limit—a behavior called "context anxiety." The team addressed this with context resets in the harness. However, Claude Opus 4.5 exhibited the same harness without the behavior, making the resets dead weight.

Since harnesses will keep evolving, Anthropic built Managed Agents: a hosted service in the Claude Platform running long-horizon agents through interfaces designed to outlast any particular implementation.

An Old Problem in Computing

Building Managed Agents required solving the challenge of designing a system for "programs as yet unthought of." Operating systems solved this decades ago by virtualizing hardware into abstractions like process and file—general enough for future programs. The read() command works regardless of whether it accesses a 1970s disk pack or a modern SSD. The abstractions stayed stable while implementations changed freely.

Managed Agents follow the same pattern by virtualizing agent components:

Session: the append-only log of everything that happened
Harness: the loop that calls Claude and routes tool calls to relevant infrastructure
Sandbox: an execution environment where Claude can run code and edit files

Each component can be swapped without disturbing the others. The system is opinionated about interface shape, not about what runs behind them.

Don't Adopt a Pet

Initially, all agent components were placed in a single container—session, harness, and sandbox sharing one environment. Benefits included direct syscall file edits and no service boundaries to design.

But coupling everything into one container created a "pets vs. cattle" infrastructure problem. The server became a named, hand-tended individual (a "pet") that couldn't be lost. If a container failed, the session was lost. Unresponsive containers had to be nursed back to health.

Debugging unresponsive stuck sessions was particularly painful. The WebSocket event stream was the only debugging window, but it couldn't reveal where failures arose—bugs in the harness, packet drops, or offline containers all looked the same. Engineers had to open a shell inside the container, but since that container often held user data, this essentially meant lacking the ability to debug.

A second issue: the harness assumed Claude's work lived in the same container. When customers wanted Claude connected to their VPC, they had to either peer their network or run the harness in their own environment. An assumption baked into the harness became a problem for connecting to different infrastructure.

Decouple the Brain from the Hands

The solution was to decouple the "brain" (Claude and its harness) from the "hands" (sandboxes and tools performing actions) and the "session" (the event log). Each became an interface making few assumptions about the others, and each could fail or be replaced independently.

The Harness Leaves the Container

Decoupling meant the harness no longer lived inside the container. It called the container like any other tool: execute(name, input) → string. The container became cattle. If the container died, the harness caught the failure as a tool-call error and passed it back to Claude. If Claude decided to retry, a new container could be reinitialized with provision({resources}). There was no longer a need to nurse failed containers.

Recovering from Harness Failure

The harness also became cattle. Since the session log sits outside the harness, nothing in the harness needs to survive a crash. When one fails, a new one boots with wake(sessionId), retrieves the event log via getSession(id), and resumes from the last event. During the agent loop, the harness writes to the session with emitEvent(id, event) to maintain a durable record.

The Security Boundary

In the coupled design, untrusted code Claude generated ran in the same container as credentials—a prompt injection only needed to convince Claude to read its own environment. With those tokens, an attacker could spawn fresh, unrestricted sessions. Narrow scoping is an obvious mitigation, but this encodes assumptions about what Claude can't do with limited tokens—and Claude keeps getting smarter.

The structural fix ensures tokens are never reachable from the sandbox where Claude's generated code runs, using two patterns:

For Git: Each repository's access token clones the repo during sandbox initialization and wires it into the local git remote. Git push and pull work from inside the sandbox without the agent ever handling the token itself.
For custom tools: MCP is supported with OAuth tokens stored in a secure vault. Claude calls MCP tools via a dedicated proxy that takes in a session-associated token, fetches corresponding credentials from the vault, and makes the external call. The harness is never made aware of any credentials.

The Session Is Not Claude's Context Window

Long-horizon tasks often exceed Claude's context window. Standard approaches involve irreversible decisions about what to keep—compaction (saving summaries), memory tools (writing context to files for cross-session learning), and context trimming (selectively removing old tool results or thinking blocks).

But irreversible retention/discard decisions can lead to failures, since it's difficult to know which tokens future turns will need. Prior research has explored storing context as an object that lives outside the context window—for example, as an object in a REPL that the LLM programmatically accesses by writing code to filter or slice it.

In Managed Agents, the session provides this benefit, serving as a context object outside Claude's context window. Context is durably stored in the session log. The getEvents() interface lets the brain interrogate context by selecting positional slices of the event stream—picking up from where it last stopped reading, rewinding before a specific moment, or rereading context before an action.

Fetched events can be transformed in the harness before being passed to Claude's context window—enabling context organization for high prompt cache hit rates and context engineering. The concerns of recoverable context storage (in the session) and arbitrary context management (in the harness) are separated because the team can't predict what context engineering future models will require. The interfaces push context management into the harness, guaranteeing only that the session is durable and available for interrogation.

Many Brains, Many Hands

Many Brains

Decoupling solved one of the earliest customer complaints. Teams wanting Claude to work against resources in their own VPC previously had to peer their network, because the container holding the harness assumed every resource sat next to it. Once the harness left the container, that assumption disappeared.

There was also a performance payoff. When the brain was in a container, many brains required many containers. Each session paid the full container setup cost up front—even sessions that would never touch the sandbox had to clone the repo, boot the process, and fetch pending events.

This dead time appears in time-to-first-token (TTFT)—the latency a user most acutely feels. Decoupling means containers are provisioned by the brain via tool call only when needed. Sessions that don't need a container right away don't wait for one. Inference starts as soon as the orchestration layer pulls pending events from the session log. With this architecture, p50 TTFT dropped roughly 60% and p95 dropped over 90%. Scaling to many brains means starting many stateless harnesses, connecting them to hands only if needed.

Many Hands

The team also wanted each brain to connect to many hands. In practice, Claude must reason about multiple execution environments and decide where to send work—a harder cognitive task than operating in a single shell. The initial single-container design was chosen because earlier models couldn't handle this. As intelligence scaled, the single container became the limitation: when it failed, state was lost for every hand the brain was reaching into.

Decoupling makes each hand a tool: execute(name, input) → string—name and input go in, string comes out. That interface supports any custom tool, any MCP server, and Anthropic's own tools. The harness doesn't know whether the sandbox is a container, a phone, or a Pokémon emulator. And because no hand is coupled to any brain, brains can pass hands to one another.

Conclusion

The challenge was an old one: designing a system for "programs as yet unthought of." Operating systems have lasted decades by virtualizing hardware into general abstractions. With Managed Agents, Anthropic aimed to design a system accommodating future harnesses, sandboxes, or other components around Claude.

Managed Agents is a meta-harness unopinionated about the specific harness Claude will need in the future—a system with general interfaces allowing many different harnesses. For example, Claude Code is an excellent harness used widely across tasks, and task-specific agent harnesses excel in narrow domains. Managed Agents can accommodate any of these, matching Claude's intelligence over time.

Meta-harness design means being opinionated about interfaces around Claude: Claude will need the ability to manipulate state (the session) and perform computation (the sandbox), and will require the ability to scale to many brains and many hands. The interfaces are designed to run reliably and securely over long time horizons. But no assumptions are made about the number or location of brains or hands Claude will need.

Scaling Managed Agents: Decoupling the brain from the hands ​

An Old Problem in Computing ​

Don't Adopt a Pet ​

Decouple the Brain from the Hands ​

The Harness Leaves the Container ​

Recovering from Harness Failure ​

The Security Boundary ​

The Session Is Not Claude's Context Window ​

Many Brains, Many Hands ​

Many Brains ​

Many Hands ​

Conclusion ​