Skip to content

Designing AI-Resistant Technical Evaluations

Published: Jan 21, 2026

Author: Tristan Hume, a lead on Anthropic's performance optimization team


Overview

This article describes how Anthropic's performance engineering team created and iteratively redesigned a take-home test for hiring performance engineers, as successive Claude models kept matching or exceeding human performance on the test.

The team has used a take-home since early 2024 where candidates optimize code for a simulated accelerator. Over 1,000 candidates completed it, and dozens were hired, including engineers who "brought up our Trainium cluster and shipped every model since Claude 3 Opus."

Each new Claude model forced redesigns. "Claude Opus 4 outperformed most human applicants," and then "Claude Opus 4.5 matched even those." Humans can still outperform models with unlimited time, but within the take-home's time constraints, distinguishing top candidates from the model became impossible.

The author iterated through three versions, learning "something new about what makes evaluations robust to AI assistance and what doesn't."

The original take-home is being released as an open challenge on GitHub.


The Origin of the Take-Home

In November 2023, Anthropic was preparing to train Claude Opus 3 with new TPU and GPU clusters plus a large Trainium cluster coming online, but lacked sufficient performance engineers. A Twitter post brought in more candidates than the standard interview pipeline could handle.

The author spent two weeks designing a take-home to evaluate candidates more efficiently.

Design Goals

The author wanted something "genuinely engaging that would make candidates excited to participate." The format offers advantages for evaluating performance engineering skills:

  • Longer time horizon: A 4-hour window (later 2 hours) better reflects real work than a 50-minute interview
  • Realistic environment: No one watching; candidates work in their own editor
  • Time for comprehension and tooling: Performance optimization requires understanding existing systems and sometimes building debugging tools
  • Compatibility with AI assistance: Candidates are explicitly allowed to use AI tools, since longer-horizon problems are harder for AI to solve completely

Additional principles applied to the design:

  • Representative of real work
  • High signal: Avoiding problems that hinge on a single insight; ensuring wide scoring distribution with enough depth that strong candidates don't finish everything
  • No specific domain knowledge: Good fundamentals matter more than narrow expertise
  • Fun: Fast development loops, interesting problems, and room for creativity

The Simulated Machine

The author built a Python simulator for a fake accelerator with TPU-like characteristics. Candidates optimize code running on this machine, using a hot-reloading Perfetto trace showing every instruction.

The machine features: manually managed scratchpad memory, VLIW (multiple execution units running in parallel each cycle), SIMD (vector operations), and multicore (distributing work across cores).

The task is a parallel tree traversal, deliberately not deep learning flavored. It was "inspired by branchless SIMD decision tree inference, a classical ML optimization challenge." Candidates start with a fully serial implementation and progressively exploit parallelism — first multicore, then choosing between SIMD vectorization or VLIW instruction packing. Version 1 also included a bug requiring debugging.


Early Results

The initial take-home worked well. The standout from the Twitter batch "started in early February, two weeks after our first hires through the standard pipeline." The test proved predictive — this hire immediately began optimizing kernels and found a workaround for a launch-blocking compiler bug.

Over about 1.5 years, roughly 1,000 candidates completed the take-home. It proved "especially valuable for candidates with limited experience on paper: several of our highest-performing engineers came directly from undergrad."

Many candidates worked past the 4-hour limit because they enjoyed it. The strongest unlimited-time submissions included "full optimizing mini-compilers and several clever optimizations I hadn't anticipated."


Then Claude Opus 4 Defeated It

By May 2025, "Claude 3.7 Sonnet had already crept up to the point where over 50% of candidates would have been better off delegating to Claude Code entirely." Testing a pre-release Claude Opus 4, it produced a more optimized solution than almost all humans within the 4-hour limit.

The author had previously designed a live interview question in 2023 to be more problem-solving focused than knowledge-based. "Claude 3 Opus beat part 1 of that question; Claude 3.5 Sonnet beat part 2."

The fix for the take-home was straightforward — the problem had far more depth than anyone could explore in 4 hours, so the author used Claude Opus 4 to identify where it started struggling, making that the new starting point. Version 2 added cleaner starter code, new machine features, removed multicore, and shortened the time limit to 2 hours (reducing scheduling delays). It emphasized "clever optimization insights over debugging and code volume."


Then Claude Opus 4.5 Defeated That

Testing a pre-release Claude Opus 4.5 checkpoint, the author watched Claude Code work for 2 hours, solving initial bottlenecks and implementing common micro-optimizations, passing the threshold in under an hour.

Then it stopped, "convinced it had hit an insurmountable memory bandwidth bottleneck." Most humans reach the same conclusion, but clever tricks exploiting the problem structure exist. When told the achievable cycle count, Claude found the trick, debugged, tuned, and implemented further optimizations. By the 2-hour mark, its score matched the best human performance — "and that human had made heavy use of Claude 4 with steering."

Tested in Anthropic's internal test-time compute harness, it confirmed it could both beat humans in 2 hours and continue climbing with time.


Considering the Options

Some colleagues suggested banning AI assistance. The author didn't want this: "I had a sense that given people continue to play a vital role in our work, I should be able to figure out some way for them to distinguish themselves."

Others suggested raising the bar to substantially outperform Claude Code alone. The concern was that Claude works fast, and humans typically spend half the 2 hours reading and understanding the problem. "The dominant strategy might become sitting back and watching."

Performance engineers at Anthropic now do "tough debugging, systems design, performance analysis," and making Claude's code simpler and more elegant. These are hard to test objectively without a lot of time or common context.

The author also worried that either Claude would solve any new take-home too, or it would become too challenging for humans to complete in two hours.


Attempt 1: A Different Optimization Problem

The author chose a problem based on a tricky kernel optimization done at Anthropic: efficient data transposition on 2D TPU registers while avoiding bank conflicts. Claude implemented the changes in under a day.

"Claude Opus 4.5 found a great optimization I hadn't even thought of." It realized it could transpose the entire computation rather than the data, rewriting the whole program. The author patched the problem to remove that approach. Claude made progress but couldn't find the most efficient solution — until the author double-checked using Claude Code's "ultrathink" feature with longer thinking budgets, and it solved it, even knowing tricks for fixing bank conflicts.

The author reflected that engineers across many platforms have struggled with data transposition and bank conflicts, so Claude has substantial training data. "While I'd found my solution from first principles, Claude could draw on a larger toolbox of experience."


Attempt 2: Going Weirder

The author needed a problem where human reasoning could win over Claude's larger experience base — something sufficiently out of distribution, which conflicted with the goal of resembling the job.

Inspired by Zachtronics games (programming puzzle games with unusual, highly constrained instruction sets), the author designed a new take-home with puzzles using a tiny, heavily constrained instruction set, optimizing for minimal instruction count. Testing one medium-hard puzzle on Claude Opus 4.5, "it failed." More puzzles were added and colleagues verified people could outperform Claude.

Unlike Zachtronics games, no visualization or debugging tools were provided intentionally. "Building debugging tools is part of what's being tested."

The author is "reasonably happy with the new take-home." Early results show scores correlating well with candidates' caliber of past work. However, the author is "still sad to have given up the realism and varied depth of the original." The insight: "The original worked because it resembled real work. The replacement works because it simulates novel work."


An Open Challenge

The original take-home is released for anyone to try with unlimited time. Human experts retain an advantage over current models at sufficiently long time horizons. "The fastest human solution ever submitted substantially exceeds what Claude has achieved even with extensive test-time compute."

Performance Benchmarks (clock cycles on the simulated machine)

ResultDescription
2164 cyclesClaude Opus 4 after many hours in the test-time compute harness
1790 cyclesClaude Opus 4.5 in a casual Claude Code session, approx. matching best human performance in 2 hours
1579 cyclesClaude Opus 4.5 after 2 hours in the test-time compute harness
1548 cyclesClaude Sonnet 4.5 after many more than 2 hours of test-time compute
1487 cyclesClaude Opus 4.5 after 11.5 hours in the harness
1363 cyclesClaude Opus 4.5 in an improved test time compute harness after many hours

The challenge is available on GitHub. Those who optimize below 1487 cycles — beating Claude's best performance at launch — are invited to email performance-recruiting@anthropic.com with code and a resume. Candidates can also apply through Anthropic's typical process, which uses their now Claude-resistant take-home. "We're curious how long it lasts."

AI 落地咨询
艾维禾砺数字科技

企业 AI 落地全链路服务

Agent 开发工作流搭建Claude Code 集成
微信咨询
d187l8801b6124
访问官网 ivheli.com