Harness Design for Long-Running Application Development

Published: Mar 24, 2026

Author: Prithvi Rajasekaran, member of Anthropic's Labs team

Harness design is key to performance at the frontier of agentic coding. This article describes how Anthropic pushed Claude further in frontend design and long-running autonomous software engineering.

Background

Over several months, the author worked on two interconnected problems: producing high-quality frontend designs and building complete applications without human intervention. This work originated with earlier efforts on a frontend design skill and a long-running coding agent harness, where prompt engineering and harness design improved Claude's performance above baseline — though both eventually hit ceilings.

To break through, the author sought novel AI engineering approaches that held across two different domains: one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), a multi-agent structure was designed with a generator and evaluator agent. Building a reliable evaluator meant developing criteria that could turn subjective judgments into concrete, gradable terms.

These techniques were then applied to long-running autonomous coding, carrying over two lessons: decomposing the build into tractable chunks and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture — planner, generator, and evaluator — that produced rich full-stack applications over multi-hour autonomous coding sessions.

Why Naive Implementations Fall Short

Previous work showed that harness design substantially impacts the effectiveness of long-running agentic coding. In an earlier experiment, an initializer agent decomposed a product spec into a task list, and a coding agent implemented features one at a time before handing off artifacts across sessions. The broader developer community converged on similar insights, with approaches like the "Ralph Wiggum" method using hooks or scripts for continuous iteration cycles.

But persistent problems remained for complex tasks. Two common failure modes were observed:

Loss of coherence: Models tend to lose coherence on lengthy tasks as the context window fills. Some models also exhibit "context anxiety," wrapping up work prematurely as they approach what they believe is their context limit. Context resets — clearing the context window entirely and starting a fresh agent with structured handoff — address both issues. This differs from compaction, where earlier conversation is summarized in place. While compaction preserves continuity, it doesn't provide a clean slate. In earlier testing, Claude Sonnet 4.5 exhibited context anxiety strongly enough that compaction alone wasn't sufficient, making context resets essential. This solved the core issue but added orchestration complexity, token overhead, and latency.

Poor self-evaluation: When asked to evaluate their own work, agents tend to confidently praise it even when quality is obviously mediocre. This is particularly pronounced for subjective tasks like design, where there is no binary check equivalent to a verifiable software test. Separating the agent doing the work from the agent judging it proves to be a strong lever. The separation doesn't immediately eliminate leniency, but tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work. Once external feedback exists, the generator has something concrete to iterate against.

Frontend Design: Making Subjective Quality Gradable

The author started by experimenting on frontend design, where the self-evaluation issue was most visible. Without intervention, Claude gravitates toward safe, predictable layouts that are technically functional but visually unremarkable.

Two insights shaped the harness:

While aesthetics can't be fully reduced to a score, they can be improved with grading criteria that encode design principles and preferences. Instead of asking "is this design beautiful?", asking "does this follow our principles for good design?" gives something concrete to grade against.
By separating frontend generation from frontend grading, a feedback loop can drive the generator toward stronger outputs.

The Four Grading Criteria

Both the generator and evaluator agents received four grading criteria in their prompts:

Design quality: Does the design feel like a coherent whole rather than a collection of parts? Colors, typography, layout, imagery, and other details should combine to create a distinct mood and identity.
Originality: Is there evidence of custom decisions, or are these template layouts, library defaults, and AI-generated patterns? A human designer should recognize deliberate creative choices. Unmodified stock components or telltale AI patterns like purple gradients over white cards fail here.
Craft: Technical execution — typography hierarchy, spacing consistency, color harmony, contrast ratios. A competence check rather than a creativity check. Most reasonable implementations do fine; failing means broken fundamentals.
Functionality: Usability independent of aesthetics. Can users understand the interface, find primary actions, and complete tasks without guessing?

Design quality and originality were weighted more heavily than craft and functionality, since Claude already scored well on the latter by default. The criteria explicitly penalized highly generic "AI slop" patterns, pushing the model toward more aesthetic risk-taking.

The evaluator was calibrated using few-shot examples with detailed score breakdowns, ensuring judgment alignment and reducing score drift across iterations.

The Feedback Loop

The loop was built on the Claude Agent SDK. A generator agent created an HTML/CSS/JS frontend based on a user prompt. The evaluator was given the Playwright MCP, letting it interact with the live page directly before scoring each criterion and writing a detailed critique. In practice, the evaluator would navigate the page on its own, screenshotting and carefully studying the implementation before producing its assessment.

Five to fifteen iterations were run per generation, with each iteration typically pushing the generator in a more distinctive direction. Because the evaluator was actively navigating the page rather than scoring a static screenshot, each cycle took real wall-clock time. Full runs stretched up to four hours. The generator was instructed to make a strategic decision after each evaluation: refine the current direction if scores were trending well, or pivot to an entirely different aesthetic if the approach wasn't working.

Results

Evaluator assessments improved over iterations before plateauing, with headroom still remaining. Some generations refined incrementally; others took sharp aesthetic turns between iterations.

The wording of the criteria steered the generator in unexpected ways. Including phrases like "the best designs are museum quality" pushed designs toward a particular visual convergence, suggesting that the prompting language directly shaped output character.

While scores generally improved, the pattern wasn't always cleanly linear. Later implementations tended to be better as a whole, but the author regularly saw cases where a middle iteration was preferred over the last one. Implementation complexity tended to increase across rounds. Even on the first iteration, outputs were noticeably better than baseline with no prompting, suggesting the criteria themselves steered the model away from generic defaults before any evaluator feedback.

In one notable example with a Dutch art museum website prompt, by the ninth iteration the model had produced a clean, dark-themed landing page. On the tenth cycle, it scrapped the approach entirely and reimagined the site as a spatial experience: a 3D room with a checkered floor rendered in CSS perspective, artwork hung on walls in free-form positions, and doorway-based navigation between gallery rooms instead of scroll or click.

Scaling to Full-Stack Coding

The GAN-inspired pattern was applied to full-stack development. The generator-evaluator loop maps naturally onto the software development lifecycle, where code review and QA serve the same structural role as the design evaluator.

The Architecture

Building on the foundation from the original harness, a three-agent system was created:

Planner: The previous harness required the user to provide a detailed spec upfront. A planner agent was created that took a simple 1–4 sentence prompt and expanded it into a full product spec. It was prompted to be ambitious about scope and to stay focused on product context and high-level technical design rather than detailed technical implementation. The concern was that if the planner tried to specify granular technical details and got something wrong, errors would cascade into downstream implementation. The planner was also asked to find opportunities to weave AI features into product specs.

Generator: The one-feature-at-a-time approach from the earlier harness was applied here, instructing the generator to work in sprints, picking up one feature at a time from the spec. Each sprint implemented the app with a React, Vite, FastAPI, and SQLite (later PostgreSQL) stack. The generator was instructed to self-evaluate its work at the end of each sprint before handing off to QA. It also had git for version control.

Evaluator: Applications from earlier harnesses often looked impressive but still had real bugs. The evaluator used the Playwright MCP to click through the running application the way a user would, testing UI features, API endpoints, and database states. It graded each sprint against bugs found and criteria modeled on the frontend experiment, adapted to cover product depth, functionality, visual design, and code quality. Each criterion had a hard threshold — if any one fell below it, the sprint failed and the generator got detailed feedback.

Before each sprint, the generator and evaluator negotiated a sprint contract: agreeing on what "done" looked like before any code was written. This bridged the gap between user stories and testable implementation since the product spec was intentionally high-level. The generator proposed what it would build and how success would be verified, and the evaluator reviewed the proposal. The two iterated until they agreed.

Communication was handled via files: one agent would write a file, another would read it and respond. This kept work faithful to the spec without over-specifying implementation too early.

The model used was Claude Opus 4.5, and context resets were dropped from this harness since Opus 4.5 largely removed the context anxiety behavior on its own. Agents ran as one continuous session across the whole build, with the Claude Agent SDK's automatic compaction handling context growth.

Running the Harness

The following prompt was used:

"Create a 2D retro game maker with features including a level editor, sprite editor, entity behaviors, and a playable test mode."

Results comparison:

Harness	Duration	Cost
Solo	20 min	$9
Full harness	6 hr	$200

The harness was over 20x more expensive, but the difference in output quality was immediately apparent.

Solo run issues: The layout wasted space with fixed-height panels leaving most of the viewport empty. The workflow was rigid with no guidance toward the correct sequence. Most critically, the actual game was broken — entities appeared on screen but nothing responded to input. The wiring between entity definitions and the game runtime was broken.

Harness run results: The planner expanded the one-sentence prompt into a 16-feature spec spread across ten sprints. Beyond the core editors and play mode, the spec called for sprite animation, behavior templates, sound effects and music, AI-assisted sprite generation and level design, and game export with shareable links. The planner was given access to the frontend design skill and used it to create a visual design language for the app.

The app immediately showed more polish. The canvas used the full viewport, panels were sized sensibly, and the interface had a consistent visual identity. The sprite editor was richer with cleaner tool palettes, a better color picker, and more usable zoom controls. Because the planner was asked to weave AI features into specs, the app included a built-in Claude integration for generating game parts through prompting.

The biggest difference was in play mode — the user could actually move the entity and play the game. The physics had some rough edges (character overlapping with platforms), but the core worked, which the solo run did not achieve.

Evaluator Performance

Reading through the logs, the evaluator kept the implementation in line with the spec. Each sprint, it walked through the sprint contract's test criteria and exercised the running application through Playwright, filing bugs against anything that diverged from expected behavior. The contracts were granular — Sprint 3 alone had 27 criteria covering the level editor.

Examples of evaluator findings included: the rectangle fill tool only placing tiles at drag start/end points instead of filling the region; a delete key handler requiring both selection and selectedEntityId but clicking only setting one; and a FastAPI route ordering issue where "reorder" was matched as an integer frame_id.

Getting the evaluator to perform at this level took work. Out of the box, Claude was a poor QA agent. In early runs, it would identify legitimate issues then talk itself into deciding they weren't a big deal and approve the work anyway. It also tended to test superficially. The tuning loop involved reading evaluator logs, finding examples where judgment diverged from the author's, and updating the QA prompt to solve those issues. Several rounds were needed before the evaluator graded reasonably.

Iterating on the Harness

The first set of results was encouraging but bulky, slow, and expensive. The logical next step was simplifying without degrading performance. Every component in a harness encodes an assumption about what the model can't do on its own, and those assumptions are worth stress testing.

The author's first attempt to radically simplify couldn't replicate original performance, and it became difficult to tell which pieces were load-bearing. A more methodical approach followed: removing one component at a time and reviewing impact.

Opus 4.6 provided further motivation to reduce complexity. From the launch blog: the model "plans more carefully, sustains agentic tasks for longer, can operate more reliably in larger codebases, and has better code review and debugging skills."

Removing the Sprint Construct

The sprint construct was removed entirely. Given Opus 4.6's improvements, the model could likely handle work without this decomposition. Both the planner and evaluator were kept, as each continued adding obvious value. Without the planner, the generator under-scoped, creating less feature-rich applications.

With sprints removed, the evaluator moved to a single pass at the end of the run. On Opus 4.5, builds were at the edge of what the generator could do well solo, and the evaluator caught meaningful issues. On Opus 4.6, the boundary moved outward. Tasks that needed the evaluator's check before were now often within what the generator handled on its own. But for parts still at the edge of capabilities, the evaluator continued to give real lift.

The practical implication: the evaluator is worth the cost when the task sits beyond what the current model does reliably solo.

Prompting was also added to improve how the harness built AI features into each app, specifically getting the generator to build a proper agent that could drive app functionality through tools.

Results from the Updated Harness

The following prompt was used to generate a Digital Audio Workstation:

"Build a fully featured DAW in the browser using the Web Audio API."

The run took about 4 hours and $124 in token costs.

Breakdown:

Agent & Phase	Duration	Cost
Planner	4.7 min	$0.46
Build (Round 1)	2 hr 7 min	$71.08
QA (Round 1)	8.8 min	$3.24
Build (Round 2)	1 hr 2 min	$36.89
QA (Round 2)	6.8 min	$3.09
Build (Round 3)	10.9 min	$5.88
QA (Round 3)	9.6 min	$4.06
Total V2 Harness	3 hr 50 min	$124.70

The generator ran coherently for over two hours without the sprint decomposition that Opus 4.5 had needed.

The QA agent still caught real gaps. In its first-round feedback, it noted the app looked impressive but several core DAW features were "display-only without interactive depth" — clips couldn't be dragged, there were no instrument UI panels, and no visual effect editors. In its second round, it caught remaining gaps including stub-only audio recording, missing clip resize/split, and numeric-only effect visualizations.

The final app had all core pieces of a functional music production program: a working arrangement view, mixer, and transport running in the browser. The user was able to put together a short song snippet entirely through prompting — the agent set tempo and key, laid down a melody, built a drum track, adjusted mixer levels, and added reverb. The core primitives for song composition were present, and the agent could drive them autonomously.

What Comes Next

As models continue to improve, they will be capable of working longer on more complex tasks. In some cases, the scaffold matters less over time, and developers can wait for the next model. On the other hand, better models create more space for harnesses that achieve complex tasks beyond baseline capability.

Key lessons from this work:

Always experiment with the model you're building against, read its traces on realistic problems, and tune its performance
For complex tasks, there is sometimes headroom from decomposing the task and applying specialized agents to each aspect
When a new model lands, re-examine the harness — strip away pieces no longer load-bearing and add new pieces for greater capability

The author's conviction is that "the space of interesting harness combinations doesn't shrink as models improve. Instead, it moves."

Appendix

The article includes an example plan generated by the planner agent for "RetroForge - 2D Retro Game Maker," a web-based creative studio for designing 2D retro-style video games. The plan describes four integrated creative modules (tile-based Level Editor, pixel-art Sprite Editor, visual Entity Behavior system, and instant Playable Test Mode), with AI assistance powered by Claude woven throughout. The example shows the Project Dashboard & Management feature with detailed user stories covering creating, viewing, opening, deleting, and duplicating projects, along with a project data model specifying metadata, canvas settings, tile size configuration, color palette selection, and associated assets.

Harness Design for Long-Running Application Development ​

Background ​

Why Naive Implementations Fall Short ​

Frontend Design: Making Subjective Quality Gradable ​

The Four Grading Criteria ​

The Feedback Loop ​

Results ​

Scaling to Full-Stack Coding ​

The Architecture ​

Running the Harness ​

Evaluator Performance ​

Iterating on the Harness ​

Removing the Sprint Construct ​

Results from the Updated Harness ​

What Comes Next ​

Appendix ​