Effective Context Engineering for AI Agents
Published: September 29, 2025 Authors: Anthropic's Applied AI team: Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield, with contributions from Rafi Ayub, Hannah Moran, Cal Rueb, and Connor Jennings. Special thanks to Molly Vorwerck, Stuart Ritchie, and Maggie Vo.
Summary & Key Content
The article presents context engineering as the evolution beyond prompt engineering for building AI agents. Context engineering is defined as optimizing the utility of tokens against LLM constraints to consistently achieve desired outcomes. The authors emphasize "thinking in context" — considering the holistic state available to the LLM at any given time.
Context Engineering vs. Prompt Engineering
Prompt engineering focuses on writing and organizing instructions, primarily system prompts. Context engineering is broader, encompassing strategies for curating and maintaining the optimal set of tokens during inference — including system instructions, tools, MCP, external data, message history, and more. The authors describe it as "the art and science" of curating what enters the limited context window from a constantly evolving universe of possible information.
Why Context Engineering Matters
LLMs experience context rot — as token count increases, the model's ability to accurately recall information decreases. The article draws a parallel to human "limited working memory capacity," describing an "attention budget" for LLMs. This stems from transformer architecture constraints where every token attends to every other token, creating n² pairwise relationships. Models also have less training experience with longer sequences. Position encoding interpolation helps handle longer sequences but with some degradation.
The guiding principle: find the smallest possible set of high-signal tokens that maximize the likelihood of a desired outcome.
Anatomy of Effective Context
System prompts should use simple, direct language at the "right altitude" — a Goldilocks zone between two failure modes:
- Hardcoding complex, brittle logic that creates fragility
- Providing vague, high-level guidance that fails to give concrete signals
Recommendations include organizing prompts into distinct sections using XML tags or Markdown headers, and striving for the minimal set of information that fully outlines expected behavior. Start with a minimal prompt using the best model, then add instructions based on observed failure modes.
Tools should promote efficiency by returning token-efficient information and encouraging efficient agent behaviors. Tools should be self-contained, robust to error, and extremely clear in intended use. A common failure mode is "bloated tool sets" with ambiguous decision points. The article states: "If a human engineer can't definitively say which tool should be used in a given situation, an AI agent can't be expected to do better."
Examples (few-shot prompting) is strongly advised, but the article warns against stuffing prompts with edge cases. Instead, teams should curate "diverse, canonical examples" that effectively portray expected behavior. For LLMs, examples are described as "the 'pictures' worth a thousand words."
Context Retrieval and Agentic Search
The article references a simple definition of agents: "LLMs autonomously using tools in a loop."
The field is shifting toward "just in time" context strategies — rather than pre-processing all relevant data upfront, agents maintain lightweight identifiers (file paths, stored queries, web links) and dynamically load data at runtime using tools. Claude Code uses this approach for complex data analysis over large databases, where the model writes targeted queries, stores results, and uses Bash commands to analyze data without loading full data objects into context.
This mirrors human cognition — we use "external organization and indexing systems like file systems, inboxes, and bookmarks" to retrieve relevant information on demand.
Progressive disclosure allows agents to incrementally discover relevant context through exploration. Metadata from references provides signals — folder hierarchies, naming conventions, timestamps help both humans and agents understand how and when to utilize information.
The trade-off: runtime exploration is slower than pre-computed retrieval, and proper engineering is needed to ensure agents have the right tools and heuristics.
A hybrid strategy may be most effective — retrieving some data upfront for speed while pursuing further autonomous exploration. Claude Code employs this model with CLAUDE.md files dropped into context upfront, while primitives like glob and grep allow just-in-time navigation.
The article notes that "do the simplest thing that works" will likely remain the best advice for teams building agents.
Context Engineering for Long-Horizon Tasks
For tasks spanning tens of minutes to multiple hours, three techniques are discussed:
Compaction involves summarizing a conversation nearing the context window limit and reinitiating with the summary. In Claude Code, message history is passed to the model to summarize and compress critical details. The agent preserves architectural decisions, unresolved bugs, and implementation details while discarding redundant outputs. The article cautions that "overly aggressive compaction can result in the loss of subtle but critical context." A recommended approach: start by maximizing recall, then iterate to improve precision. Tool result clearing is described as one of the safest, lightest-touch forms of compaction.
Structured note-taking (agentic memory) involves the agent regularly writing notes persisted outside the context window, pulled back in later. The article cites Claude playing Pokemon as an example, where the agent "maintains precise tallies across thousands of game steps" and develops maps, remembers achievements, and maintains strategic combat notes. After context resets, the agent reads its own notes and continues multi-hour sequences.
Sub-agent architectures use specialized sub-agents handling focused tasks with clean context windows. Each subagent may explore extensively but returns "only a condensed, distilled summary" (often 1,000-2,000 tokens). This achieves separation of concerns and showed "substantial improvement over single-agent systems on complex research tasks."
The choice depends on task characteristics:
- Compaction maintains conversational flow for tasks requiring extensive back-and-forth
- Note-taking excels for iterative development with clear milestones
- Multi-agent architectures handle complex research and analysis where parallel exploration pays dividends
Conclusion
Context engineering represents a fundamental shift — the challenge is "thoughtfully curating what information enters the model's limited attention budget at each step." The guiding principle: find the smallest set of high-signal tokens that maximize desired outcomes. As models improve, smarter models require less prescriptive engineering, but "treating context as a precious, finite resource will remain central to building reliable, effective agents."