Quantifying infrastructure noise in agentic coding evals
Published: Feb 05, 2026
Written by: Gian Segato
The article examines how infrastructure configuration alone can shift agentic coding benchmark results by several percentage points—sometimes exceeding the gap between top-performing models on leaderboards.
Core Finding
Agentic coding evals like SWE-bench and Terminal-Bench differ from static benchmarks because "the runtime is no longer a passive container, but an integral component of the problem-solving process." Two agents operating under different resource budgets and time limits are effectively taking different tests.
In internal experiments, Terminal-Bench 2.0 showed a 6 percentage point spread (p < 0.01) between the most- and least-resourced configurations.
How the Investigation Began
Anthropic runs Terminal-Bench 2.0 on Google Kubernetes Engine. During calibration, scores diverged from the official leaderboard and infra error rates were unexpectedly high—up to 6% of tasks failing due to pod errors unrelated to model capability.
The root cause was enforcement methodology. Their Kubernetes setup treated per-task resource specs as both guaranteed allocation and hard ceiling. Container runtimes enforce resources via two parameters: a guaranteed allocation and a hard kill threshold. When set to the same value, there's zero headroom for transient spikes, causing premature OOM kills.
Terminal-Bench's leaderboard uses a different sandboxing provider with more lenient enforcement, allowing temporary overallocation.
The Experiment
They ran Terminal-Bench 2.0 across six resource configurations—from strict 1x enforcement to completely uncapped—while keeping the same Claude model, harness, and task set constant.
Key results:
- Infrastructure error rates dropped monotonically: from 5.8% at 1x to 0.5% when uncapped
- The drop from strict enforcement to 3x headroom was significant at p < 0.001
- From 1x through 3x, success scores fluctuated within noise margins (p=0.40)
- Between 3x and uncapped, success jumped ~4 percentage points while infra errors only dropped 1.6 points
- Total lift at uncapped over 1x was +6 percentage points (p < 0.01)
The article explains that up to ~3x specs, additional resources fix reliability issues. Above 3x, resources actively help agents solve problems they couldn't before, meaning "limits can actually change what the eval measures."
Impact on Measurement
Different resource configurations reward different strategies. Tight limits favor efficient, lean approaches; generous limits reward agents that exploit all available resources. The article uses a Bayesian network fitting task (bn-fit-modify) as an example: some models install the full Python data science stack, which succeeds under generous limits but crashes under tight ones during package installation.
The core finding was replicated across different Anthropic models with consistent direction but varying magnitude. A crossover experiment on SWE-bench (227 problems, 10 samples each) showed the same monotonic increase with RAM, though with a smaller 1.54 percentage point spread at 5x versus 1x—expected since SWE-bench tasks are less resource-intensive.
Other Sources of Variance
Resource allocation isn't the only hidden variable. Time limits also play a role in certain configurations. The authors note that "every element of the evaluation setup can influence the final score," including cluster health, hardware specs, concurrency level, and even egress bandwidth. They observed anecdotally that pass rates fluctuate with time of day, likely due to API latency variations.
The article notes that "the boundary between 'model capability' and 'infrastructure behavior' is blurrier than a single benchmark score suggests."
Recommendations
The authors recommend that evals specify both the guaranteed allocation and hard kill threshold per task, rather than a single pinned value. The band between the two should be calibrated so scores at the floor and ceiling fall within noise of each other. For Terminal-Bench 2.0, a 3x ceiling cut infra error rates by roughly two-thirds while keeping the score lift within noise—described as a reasonable tradeoff that neutralizes the infrastructure confounder without removing meaningful resource pressure.
Key Takeaway
The article concludes that "leaderboard differences below 3 percentage points deserve skepticism until the eval configuration is documented and matched." The observed spread across moderate resource configurations in Terminal-Bench is just below 2 percentage points, and infrastructure confounders stack on top of naive binomial confidence intervals rather than within them. A few-point lead on a leaderboard might reflect genuine capability—or might just reflect beefier hardware.
Acknowledgements: Special thanks to Nicholas Carlini, Jeremy Hadfield, Mike Merrill, and Alex Shaw.