Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Published: January 6, 2025
Author: Erik Schluntz (optimized the SWE-bench agent and wrote the blog post). Contributors include Simon Biggs, Dawn Drain, Eric Christiansen, Shauna Kravec, Felipe Rosso, Nova DasSarma, Ven Chandrasekaran, and others who helped with training.
SWE-bench is an AI evaluation benchmark that assesses a model's ability to complete real-world software engineering tasks.
The upgraded Claude 3.5 Sonnet achieved 49% on SWE-bench Verified, surpassing the previous state-of-the-art model's 45%. The post details the agent scaffold built around the model to help developers maximize performance.
What is SWE-bench?
SWE-bench tests how well a model can resolve GitHub issues from popular open-source Python repositories. For each task, the AI receives a set up Python environment and a repository checkout from just before the issue was resolved. The model must understand, modify, and test the code before submitting a proposed solution.
Each solution is graded against the real unit tests from the pull request that closed the original GitHub issue, testing whether the AI achieved the same functionality as the original human PR author.
SWE-bench evaluates an entire "agent" system — the combination of an AI model and the software scaffolding around it. This scaffolding generates prompts, parses model output to take action, and manages the interaction loop. Performance can vary significantly based on scaffolding even with the same underlying model.
Reasons for SWE-bench's popularity include:
- Real engineering tasks from actual projects rather than competition- or interview-style questions
- Not yet saturated — no model has crossed 50% on SWE-bench Verified
- Measures an entire agent rather than a model in isolation, allowing open-source developers and startups to optimize scaffoldings
The original SWE-bench dataset contains some unsolvable tasks lacking necessary context. SWE-bench Verified is a 500-problem subset reviewed by humans for solvability, providing the clearest measure of coding agent performance.
Achieving State-of-the-Art
Tool Using Agent
The design philosophy was to give "as much control as possible to the language model itself" and keep the scaffolding minimal. The agent has a prompt, a Bash Tool for executing bash commands, and an Edit Tool for viewing and editing files and directories. Sampling continues until the model decides it is finished or exceeds the 200k context length. This approach lets the model use its own judgment rather than being hardcoded into a particular pattern or workflow.
The prompt outlines a suggested approach but isn't overly long or detailed. The model freely chooses how to move between steps. If not token-sensitive, explicitly encouraging the model to produce a long response can help.
The agent prompt:
<uploaded_files>
{location}
</uploaded_files>
I've uploaded a python code repository in the directory {location} (not in /tmp/inputs). Consider the following PR description:
<pr_description>
{pr_description}
</pr_description>
Can you help me implement the necessary changes to the repository so that the requirements specified in the <pr_description> are met?
I've already taken care of all changes to any of the test files described in the <pr_description>. This means you DON'T have to modify the testing logic or any of the tests in any way!
Your task is to make the minimal changes to non-tests files in the {location} directory to ensure the <pr_description> is satisfied.
Follow these steps to resolve the issue:
1. As a first step, it might be a good idea to explore the repo to familiarize yourself with its structure.
2. Create a script to reproduce the error and execute it with `python <filename.py>` using the BashTool, to confirm the error
3. Edit the sourcecode of the repo to resolve the issue
4. Rerun your reproduce script and confirm that the error is fixed!
5. Think about edgecases and make sure your fix handles them as well
Your thinking should be thorough and so it's fine if it's very long.Bash Tool spec:
{
"name": "bash",
"description": "Run commands in a bash shell\n* When invoking this tool, the contents of the \"command\" parameter does NOT need to be XML-escaped.\n* You don't have access to the internet via this tool.\n* You do have access to a mirror of common linux and python packages via apt and pip.\n* State is persistent across command calls and discussions with the user.\n* To inspect a particular line range of a file, e.g. lines 10-25, try 'sed -n 10,25p /path/to/the/file'.\n* Please avoid commands that may produce a very large amount of output.\n* Please run long lived commands in the background, e.g. 'sleep 10 &' or start a server in the background.",
"input_schema": {
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The bash command to run."
}
},
"required": ["command"]
}
}Edit Tool description:
{
"name": "str_replace_editor",
"description": "Custom editing tool for viewing, creating and editing files\n* State is persistent across command calls and discussions with the user\n* If `path` is a file, `view` displays the result of applying `cat -n`. If `path` is a directory, `view` lists non-hidden files and directories up to 2 levels deep\n* The `create` command cannot be used if the specified `path` already exists as a file\n* If a `command` generates a long output, it will be truncated and marked with `<response clipped>` \n* The `undo_edit` command will revert the last edit made to the file at `path`\n\nNotes for using the `str_replace` command:\n* The `old_str` parameter should match EXACTLY one or more consecutive lines from the original file. Be mindful of whitespaces!\n* If the `old_str` parameter is not unique in the file, the replacement will not be performed. Make sure to include enough context in `old_str` to make it unique\n* The `new_str` parameter should contain the edited lines that should replace the `old_str`",
"input_schema": {
"type": "object",
"properties": {
"command": {
"type": "string",
"enum": ["view", "create", "str_replace", "insert", "undo_edit"],
"description": "The commands to run. Allowed options are: `view`, `create`, `str_replace`, `insert`, `undo_edit`."
},
"file_text": {
"description": "Required parameter of `create` command, with the content of the file to be created.",
"type": "string"
},
"insert_line": {
"description": "Required parameter of `insert` command. The `new_str` will be inserted AFTER the line `insert_line` of `path`.",
"type": "integer"
},
"new_str": {
"description": "Required parameter of `str_replace` command containing the new string. Required parameter of `insert` command containing the string to insert.",
"type": "string"
},
"old_str": {
"description": "Required parameter of `str_replace` command containing the string in `path` to replace.",
"type": "string"
},
"path": {
"description": "Absolute path to file or directory, e.g. `/repo/file.py` or `/repo`.",
"type": "string"
},
"view_range": {
"description": "Optional parameter of `view` command when `path` points to a file. If none is given, the full file is shown. If provided, the file will be shown in the indicated line number range, e.g. [11, 12] will show lines 11 and 12. Indexing at 1 to start. Setting `[start_line, -1]` shows all lines from `start_line` to the end of the file.",
"items": { "type": "integer" },
"type": "array"
}
},
"required": ["command", "path"]
}
}One performance improvement came from "error-proofing" tools. For instance, since models sometimes mishandled relative file paths after moving out of the root directory, the tool was made to always require an absolute path.
Several strategies for specifying edits were tested, with string replacement achieving the highest reliability. The model specifies old_str to replace with new_str, and replacement only occurs if there is exactly one match. If there are more or fewer matches, an error message is shown for retry.
Results
| Model | SWE-bench Verified score |
|---|---|
| Claude 3.5 Sonnet (new) | 49% |
| Previous SOTA | 45% |
| Claude 3.5 Sonnet (old) | 33% |
| Claude 3 Opus | 22% |
All scores use the same agent scaffold. The upgraded Claude 3.5 Sonnet demonstrates higher reasoning, coding, and mathematical abilities along with improved agentic capabilities compared to prior models and the previous state-of-the-art.
Examples of Agent Behavior
The benchmark used the SWE-Agent framework as a foundation. Agent text output, tool calls, and tool responses are rendered as THOUGHT, ACTION, and OBSERVATION, though the model isn't constrained to a fixed ordering.
Walkthrough: RidgeClassifierCV Issue
Initial prompt (with {pr_description} filled in from a real SWE-bench task about a store_cv_values parameter error in scikit-learn's RidgeClassifierCV):
The task described a TypeError when passing store_cv_values=True to RidgeClassifierCV, even though some attributes depend on it.
Step 1 — Explore repo structure: The model used the Edit Tool's view command on /repo, discovering the repository layout including sklearn/linear_model, sklearn/base.py, and other directories.
Step 2 — Create reproduction script: The model created /repo/reproduce_error.py with the reproduction code from the PR description.
Step 3 — Run reproduction script: The Bash Tool executed the script, confirming the TypeError: __init__() got an unexpected keyword argument 'store_cv_values'.
Step 4 — Fix the source code: The model identified that RidgeClassifierCV inherited from _BaseRidgeCV but failed to pass store_cv_values from its constructor to the base class. Using the Edit Tool's str_replace command on /repo/sklearn/linear_model/ridge.py, it modified the __init__ method to accept store_cv_values=False as a parameter and pass it through to the super().__init__() call.
This particular example completed in 12 steps. Some tasks required more than 100 turns before submission; in others the model kept trying until context was exhausted.
Compared to older models, the updated 3.5 Sonnet "self-corrects more often" and "shows an ability to try several different solutions" rather than repeating the same mistakes.
Challenges
SWE-bench Verified is powerful but more complex to run than simple single-turn evaluations:
Duration and high token costs. Many successful runs took hundreds of turns and over 100k tokens. The updated Claude 3.5 Sonnet is tenacious and can often find its way around a problem given enough time, but that can be expensive.
Grading. Inspecting failed tasks revealed cases where the model behaved correctly but environment setup issues or problems with install patches being applied twice caused failures. Resolving these systems issues is crucial for accurate performance assessment.
Hidden tests. Since the model cannot see the grading tests, it often believes it has succeeded when the task actually fails. Some failures stem from solving at the wrong level of abstraction. Others occur when the model solves the problem but doesn't match the original unit tests.
Multimodal. Despite the updated Claude 3.5 Sonnet having excellent vision capabilities, no implementation allowed it to view files saved to the filesystem or referenced as URLs. This made debugging certain tasks (especially Matplotlib-related ones) difficult and prone to hallucinations. SWE-bench has launched a multimodal evaluation, and the team looks forward to developers achieving higher scores with Claude.
Summary
The upgraded Claude 3.5 Sonnet achieved 49% on SWE-bench Verified, beating the previous state-of-the-art of 45%, using a simple prompt and two general-purpose tools (Bash and Edit). The team expressed confidence that developers building with the new model will quickly find better ways to improve SWE-bench scores.
Acknowledgements
Erik Schluntz optimized the SWE-bench agent and wrote the blog post. Simon Biggs, Dawn Drain, and Eric Christiansen helped implement the benchmark. Shauna Kravec, Dawn Drain, Felipe Rosso, Nova DasSarma, Ven Chandrasekaran, and many others contributed to training Claude 3.5 Sonnet for agentic coding.