Effective harnesses for long-running agents
Published: Nov 26, 2025
Author: Justin Young
Agents still face challenges working across many context windows. Anthropic looked to human engineers for inspiration in creating a more effective harness for long-running agents.
As AI agents grow more capable, developers are increasingly assigning them complex tasks that span hours or even days. However, getting agents to make consistent progress across multiple context windows remains an open problem.
The core challenge is that agents must work in discrete sessions, with each new session beginning with no memory of what came before. The article uses the analogy of a software project staffed by engineers working in shifts, where each new engineer arrives with no knowledge of the previous shift. Since context windows are limited and most complex projects can't be completed within a single window, agents need a way to bridge the gap between coding sessions.
The Solution
Anthropic developed a two-fold solution for the Claude Agent SDK: an initializer agent that sets up the environment on the first run, and a coding agent tasked with making incremental progress in every session while leaving clear artifacts for the next session. Code examples are available in the accompanying quickstart.
The long-running agent problem
The Claude Agent SDK is described as a "powerful, general-purpose agent harness adept at coding" with context management capabilities such as compaction. Theoretically, this should allow an agent to do useful work for an arbitrarily long time.
However, compaction alone isn't sufficient. Even Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows falls short of building a production-quality web app from just a high-level prompt.
Claude's failures manifested in two patterns:
- Trying to do too much at once — attempting to one-shot the app, often running out of context mid-implementation, leaving the next session to guess at what happened.
- Premature completion — after some features were built, a later agent instance would look around, see progress had been made, and declare the job done.
The solution decomposes into two parts: setting up an initial environment that lays the foundation for all required features, and prompting each agent to make incremental progress while leaving the environment in a clean state — "no major bugs, the code is orderly and well-documented."
Environment management
Feature list
The initializer agent was prompted to write a comprehensive file of feature requirements expanding on the user's initial prompt. For the claude.ai clone example, this meant over 200 features, such as "a user can open a new chat, type in a query, press enter, and see an AI response." All features were initially marked as "failing" so later coding agents had a clear outline of what full functionality looked like.
Each feature entry is structured as JSON with fields for category, description, steps, and a passes boolean. An example feature entry describes a "New chat button creates a fresh conversation" with verification steps and passes: false.
Coding agents are prompted to edit this file only by changing the status of the passes field, with strongly-worded instructions against removing or editing tests. JSON was chosen over Markdown because "the model is less likely to inappropriately change or overwrite JSON files."
Incremental progress
The coding agent was asked to work on only one feature at a time. This incremental approach was critical to addressing the tendency to do too much at once. The model was also asked to commit progress to git with descriptive commit messages and write summaries in a progress file, enabling it to use git to revert bad changes and recover working states.
Testing
A major failure mode was Claude's tendency to mark features as complete without proper testing. Claude would make code changes and test with unit tests or curl commands but fail to recognize that features didn't work end-to-end.
For web app building, Claude performed well at verifying features end-to-end once explicitly prompted to use browser automation tools and test as a human user would. The article shows screenshots taken by Claude through the Puppeteer MCP server as it tested the claude.ai clone.
Providing testing tools dramatically improved performance. Some limitations remain, such as Claude's vision limitations and browser automation tool constraints — for example, Claude can't see browser-native alert modals through the Puppeteer MCP.
Getting up to speed
Every coding agent is prompted to run through a series of steps:
- Run
pwdto see the working directory - Read git logs and progress files to understand recent work
- Read the features list file and choose the highest-priority incomplete feature
The initializer agent writes an init.sh script to run the development server, and the coding agent runs through a basic end-to-end test before implementing a new feature. This ensures Claude can quickly identify if the app is in a broken state before making changes worse.
A typical session starts with the agent getting its bearings, reading the progress file and feature list, checking git logs, starting the development server, verifying fundamental features still work, and then beginning work on a new feature.
Agent failure modes and solutions
| Problem | Initializer Agent Behavior | Coding Agent Behavior |
|---|---|---|
| Claude declares victory too early | Set up a structured JSON feature list file | Read feature list at session start; choose a single feature |
| Claude leaves environment buggy or undocumented | Create initial git repo and progress notes file | Read progress notes and git logs at start; run basic test; commit and update at end |
| Claude marks features as done prematurely | Set up a feature list file | Self-verify all features; only mark "passing" after careful testing |
| Claude spends time figuring out how to run the app | Write an init.sh script | Read init.sh at session start |
Future work
The research demonstrates one possible set of solutions for long-running agent harnesses. Open questions remain:
- Whether a single general-purpose coding agent performs best across contexts, or if a multi-agent architecture (testing agent, QA agent, code cleanup agent) could achieve better performance
- How to generalize these findings beyond full-stack web app development to fields like scientific research or financial modeling
Acknowledgements
Written by Justin Young, with special thanks to David Hershey, Prithvi Rajasakeran, Jeremy Hadfield, Naia Bouscal, Michael Tingley, Jesse Mu, Jake Eaton, Marius Buleandara, Maggie Vo, Pedram Navid, Nadine Yasser, and Alex Notov. The work reflects collective efforts across Anthropic, especially the code RL & Claude Code teams.
Footnotes
- The agents are referred to as separate only because they have different initial user prompts. The system prompt, tools, and overall agent harness were otherwise identical.