How We Built Our Multi-Agent Research System
Published: Jun 13, 2025
Authors: Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford
Overview
Claude's Research feature enables searching across the web, Google Workspace, and integrations to handle complex tasks. The article details engineering challenges and lessons from building this multi-agent system. The system uses multiple Claude agents working together — an orchestrator that plans research based on user queries, then spawns parallel agents for simultaneous information gathering.
Benefits of a Multi-Agent System
Research involves open-ended problems where required steps can't be predicted in advance. The process is inherently dynamic and path-dependent, mirroring how humans continuously update their approach based on discoveries.
AI agents suit research because the work demands flexibility to pivot as investigations unfold, with autonomous decision-making across many turns. A linear pipeline can't handle this.
The authors frame the essence of search as compression — distilling insights from vast corpora. Subagents enable this by running parallel context windows, exploring different aspects simultaneously, then condensing findings for the lead agent. Each subagent also provides separation of concerns with distinct tools, prompts, and exploration trajectories.
They draw an analogy to collective human intelligence: while individual human intelligence hasn't changed much in 100,000 years, societies became "exponentially more capable in the information age because of our collective intelligence and ability to coordinate."
Internal evaluations showed the multi-agent system with Claude Opus 4 as lead and Claude Sonnet 4 subagents "outperformed single-agent Claude Opus 4 by 90.2%" on their research eval. For a task identifying all board members of IT S&P 500 companies, the multi-agent system succeeded through task decomposition while the single agent failed with slow, sequential searches.
Three factors explained 95% of performance variance on the BrowseComp evaluation. Token usage alone accounted for 80% of variance, with tool call count and model choice as the other factors. Upgrading to Claude Sonnet 4 provided a larger performance gain than doubling the token budget on Claude Sonnet 3.7.
Tradeoffs: Multi-agent architectures consume tokens heavily — about 15× more than standard chats. These systems suit tasks where value justifies cost, heavy parallelization is possible, information exceeds single context windows, and numerous complex tools are involved. Most coding tasks, for example, involve fewer parallelizable subtasks and aren't as good a fit.
Architecture Overview for Research
The system uses an orchestrator-worker pattern where a lead agent coordinates while delegating to specialized parallel subagents.
Workflow:
- User submits a query
- Lead agent analyzes it, develops a strategy, spawns subagents
- Subagents act as intelligent filters, iteratively using search tools
- Subagents return findings to the lead agent
- Lead agent synthesizes results, decides if more research is needed
- If sufficient, findings pass to a CitationAgent for source attribution
- Final results with citations are returned to the user
The lead agent saves its plan to Memory since context windows exceeding 200,000 tokens get truncated. Subagents independently perform web searches, evaluate results using interleaved thinking, and return findings. This contrasts with traditional RAG approaches that use static retrieval — this system dynamically adapts to new findings.
Prompt Engineering and Lessons Learned
Early agents exhibited problematic behaviors: spawning 50 subagents for simple queries, endlessly searching for nonexistent sources, and distracting each other with excessive updates. Prompt engineering was the primary lever for improvement.
The 8 Prompting Principles:
1. Think like your agents. The team built simulations using Console with exact system prompts and tools, watching agents work step-by-step. This revealed failure modes like agents continuing past sufficient results or using overly verbose queries.
2. Teach the orchestrator how to delegate. Each subagent needs an objective, output format, tool/source guidance, and clear task boundaries. Initially allowing short instructions like "research the semiconductor shortage" led to misinterpretation and duplicated work across agents.
3. Scale effort to query complexity. Embedded scaling rules: simple fact-finding needs ~1 agent with 3-10 tool calls; direct comparisons need 2-4 subagents with 10-15 calls each; complex research uses 10+ subagents with divided responsibilities. This prevents overinvestment in simple queries.
4. Tool design and selection are critical. Agent-tool interfaces matter as much as human-computer interfaces. With MCP servers providing external tools, agents encounter descriptions of varying quality. Agents were given explicit heuristics: examine all tools first, match tool usage to intent, prefer specialized tools over generic ones.
5. Let agents improve themselves. Claude 4 models can serve as excellent prompt engineers, diagnosing failures and suggesting improvements. A tool-testing agent rewrote MCP tool descriptions after testing dozens of times, "resulted in a 40% decrease in task completion time for future agents."
6. Start wide, then narrow down. Search strategy should mirror expert human research — explore the landscape before drilling into specifics. Agents tend to default to overly long, specific queries that return few results.
7. Guide the thinking process. Extended thinking mode serves as a controllable scratchpad. The lead agent uses it for planning; subagents use interleaved thinking after tool results to evaluate quality and identify gaps.
8. Parallel tool calling transforms speed and performance. Two kinds of parallelization: the lead agent spins up 3-5 subagents in parallel, and subagents use 3+ tools in parallel. "These changes cut research time by up to 90% for complex queries."
The overall prompting strategy focuses on instilling good heuristics rather than rigid rules, encoding strategies observed from skilled human researchers — decomposing questions, evaluating source quality, adjusting approaches based on new information, and knowing when to pursue depth vs. breadth.
Effective Evaluation of Agents
Multi-agent systems don't follow the same steps each time, so traditional evaluation approaches don't apply. The team needed flexible methods judging whether agents achieved correct outcomes through reasonable processes.
Start evaluating immediately with small samples. Early changes tend to have dramatic impacts. The team started with about 20 queries representing real usage patterns. Even a few test cases clearly showed the impact of changes. Teams shouldn't delay building evals waiting for large test sets.
LLM-as-judge evaluation scales when done well. Research outputs are free-form text rarely with a single correct answer. The LLM judge evaluated against rubric criteria: factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. A single LLM call with a prompt outputting 0.0-1.0 scores and pass-fail grades proved most consistent and aligned with human judgment.
Human evaluation catches what automation misses. Manual testers found edge cases evals missed — hallucinated answers on unusual queries, system failures, and source selection biases. For instance, early agents "consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources."
Multi-agent systems have emergent behaviors — small changes to the lead agent can unpredictably affect subagent behavior. The best prompts function as "frameworks for collaboration that define the division of labor, problem-solving approaches, and effort budgets."
Production Reliability and Engineering Challenges
Agents are stateful and errors compound. Long-running agents maintain state across many tool calls. Restarting from the beginning is expensive and frustrating, so the team built resumption systems. They let agents know when tools are failing so agents can adapt, combined with deterministic safeguards like retry logic and regular checkpoints.
Debugging benefits from new approaches. Agents are non-deterministic between runs. When users reported agents "not finding obvious information," full production tracing was needed to diagnose root causes — bad queries, poor source choices, or tool failures. They monitor agent decision patterns and interaction structures while maintaining user privacy by not monitoring conversation contents.
Deployment needs careful coordination. Agent systems are stateful webs running almost continuously. Code changes can't be rolled out to all agents simultaneously. Rainbow deployments gradually shift traffic from old to new versions while keeping both running.
Synchronous execution creates bottlenecks. Currently lead agents execute subagents synchronously, waiting for completion before proceeding. This simplifies coordination but prevents lead agents from steering subagents or subagents from coordinating. Asynchronous execution would enable more parallelism but adds challenges in result coordination, state consistency, and error propagation.
Conclusion
The authors emphasize that "the last mile often becomes most of the journey" with agent systems. The compound nature of errors means minor issues can derail agents entirely, and "the gap between prototype and production is often wider than anticipated."
Despite challenges, users report finding business opportunities, navigating healthcare options, resolving technical bugs, and saving days of work. Success requires careful engineering, comprehensive testing, detail-oriented prompt and tool design, robust operational practices, and tight cross-team collaboration.
A Clio embedding plot shows the top use case categories: developing software systems across specialized domains (10%), professional and technical content (8%), business growth strategies (8%), academic research (7%), and verifying information about people/places/organizations (5%).
Appendix: Additional Tips
End-state evaluation for state-mutating agents. Focus on whether agents achieved the correct final state rather than judging each step. Break evaluation into discrete checkpoints where specific state changes should have occurred.
Long-horizon conversation management. Conversations spanning hundreds of turns require intelligent compression and memory. Agents summarize completed work phases and store essential information in external memory before new tasks. When context limits approach, fresh subagents with clean contexts can be spawned while maintaining continuity through handoffs.
Subagent output to a filesystem. Direct subagent outputs can bypass the main coordinator via artifact systems where agents store work in external systems and pass lightweight references back. This prevents information loss during multi-stage processing, reduces token overhead, and works especially well for structured outputs like code, reports, or data visualizations.
Acknowledgements
The work reflects collective efforts of several teams across Anthropic. Special thanks to the Anthropic apps engineering team and early users for feedback.