Anthropic Had to Scrap Its Engineering Hiring Test Because Claude Kept Solving It

Anthropic has a problem that sounds like a tech company's version of a Zen koan: How do you test if someone is smarter than the AI you built? The company's performance engineering team has now iterated through three versions of their take-home test since 2024 because each new Claude model keeps making the previous version useless.

Tristan Hume, who leads the performance optimization team, published a detailed blog post this week explaining what happened. The short version: Claude Opus 4 beat most humans. Opus 4.5 matched the best ones. And now Anthropic had to design something that basically doesn't resemble real engineering work anymore just so humans still have a chance.

The original test

The take-home involved optimizing code for a simulated accelerator with characteristics resembling TPUs. Candidates got four hours (later cut to two) to squeeze out as much performance as possible from a parallel tree traversal task. The simulator included all the things that make accelerator optimization interesting: manually managed memory, VLIW instruction packing, SIMD vector operations.

Hume's design goals were unusually thoughtful for a coding interview. He wanted something representative of actual work, fun enough that candidates would enjoy it, and open-ended enough that strong performers could keep going. The test worked well for about a year and helped hire most of their current team, including engineers who brought up their Trainium cluster.

Then Opus 4 showed up.

How Claude beat the test

The timeline here is brutal. Claude Opus 4 outperformed most human applicants within the time limit. Hume's fix was straightforward: use Claude to identify where it started struggling and make that the new starting point. He rewrote cleaner starter code, removed the parts Claude had already solved, shortened the time from four hours to two.

That worked for several months.

When Hume tested an early Opus 4.5 checkpoint, he watched Claude Code work on the problem for two hours. It solved the initial bottlenecks, implemented all the standard micro-optimizations, and hit the passing threshold in under an hour. Then it got stuck, convinced it had hit a memory bandwidth wall.

Most humans reach the same conclusion. But there are tricks that exploit the problem structure to work around that bottleneck. When Hume told Claude the cycle count that was actually achievable, it thought for a while and found the trick.

The benchmark numbers tell the story. Claude Opus 4.5 after 2 hours in their test-time compute harness: 1,579 cycles. Best human performance in 2 hours: roughly 1,790 cycles. And with extended compute time, Claude got down to 1,487 cycles.

The new test is basically a puzzle game

Hume considered and rejected banning AI assistance. The enforcement problems are obvious, but he also had a sense that if humans still matter for this work, there should be some way for them to distinguish themselves alongside AI tools. He didn't want to concede that humans only have an advantage on tasks longer than a few hours.

His solution: Zachtronics-style puzzles. If you haven't played games like Shenzhen I/O, they use tiny, highly constrained instruction sets that force unconventional programming. State gets encoded into instruction pointers. Normal approaches don't work.

The new test consists of puzzles using a minimal, heavily constrained instruction set where candidates optimize for minimal instruction count. Intentionally, there's no visualization or debugging tools provided. Building your own debugger is part of what's being tested.

I'm sad to have given up the realism, Hume wrote. But realism may be a luxury we no longer have.

The challenge is live

Anthropic open-sourced the original take-home on GitHub. The repo already has 1.8k stars and 306 forks. The invite is explicit: if you optimize below 1,487 cycles (Claude's best score at launch), email [email protected] with your code and a resume.

There's a catch, though. None of the solutions they received on the first day below 1,300 cycles were valid. In each case, a language model had modified the tests to make the problem easier. The README now warns that if you use an AI agent, you should instruct it not to touch the tests folder.

The Hacker News thread had the predictable reactions. Some called it a knowledge test of GPU architecture (fair, kind of). Others questioned whether anyone good enough for Anthropic would bother with a take-home at all. One commenter noted the test resembles demoscene code golf.

What this actually means

The harder question isn't whether Claude can beat a hiring test. It's what happens next.

Anthropic CEO Dario Amodei said at Dreamforce that Claude already writes 90% of code for most teams at the company. But that doesn't mean fewer engineers. The argument is leverage: engineers focus on the 10% that's hardest, or supervise groups of AI models.

The irony of an AI company struggling to evaluate candidates because their AI is too good isn't lost on anyone. Schools and universities have been dealing with AI-assisted cheating for years now. The difference is Anthropic actually has to compete with the tool that's causing the problem.

Hume's blog post ends with something like resignation. The new test works for now. He expects further iterations as Claude improves.