AI Career

Anthropic Open-Sources Engineering Test That Claude 4.5 Solved Faster Than Humans

Anthropic releases its notoriously difficult performance take-home exam on GitHub. Claude Opus 4.5 beat every human candidate.

Oliver Senti
Oliver SentiSenior AI Editor
January 21, 20264 min read
Share:
Abstract illustration of a clock face made of circuit patterns with code streaming past it, representing AI speed in technical challenges

SEO METADATA

Meta Title: Anthropic Open-Sources Engineering Test That Claude 4.5 Solved Faster Than Humans Meta Description: Anthropic releases its notoriously difficult performance take-home exam on GitHub. Claude Opus 4.5 beat every human candidate. Beat 1,487 cycles to get noticed. URL Slug: anthropic-performance-takehome-opensource Primary Keyword: Anthropic engineering test Secondary Keywords: Claude Opus 4.5, performance optimization, AI coding test, Anthropic hiring Tags: ["Anthropic", "Claude", "AI coding", "engineering interview", "open source", "performance optimization", "kernel optimization", "hiring", "benchmark"]


ARTICLE

Anthropic Releases the Take-Home Test Claude 4.5 Solved Better Than Any Human

The AI lab open-sources its performance engineering challenge after its own model made the test obsolete for hiring.

Anthropic has published its internal performance engineering take-home exam to GitHub, making public a test the company stopped using because Claude Opus 4.5 outperformed every human candidate who ever attempted it.

The challenge nobody asked for

The task sounds deceptively simple: optimize a kernel running on a simulated multi-core machine, measured in clock cycles. Candidates had two hours. The baseline implementation runs at 147,734 cycles. The best human performance in the allotted time: around 1,790 cycles.

Claude Opus 4.5, in what Anthropic describes as a "casual Claude Code session," matched that human benchmark. Given two hours in their test-time compute harness, the model hit 1,579 cycles. After 11.5 hours, it reached 1,487.

The repository includes the full simulated machine architecture, a reference kernel, and a trace viewer for debugging. Hacker News commenters were quick to note the resemblance to demoscene code golf, the niche community where programmers compete to produce the smallest or fastest code for aesthetic demos.

"It's designed to select for people who can be trusted to manually write PTX," one commenter observed, referencing NVIDIA's low-level GPU assembly language.

What the test actually requires

The code presents a deliberately confusing implementation that candidates must reverse-engineer before they can improve it. The simulated machine has multiple cores, vector operations, and a scratch memory system. Candidates work in Python, but the optimization work resembles GPU kernel tuning.

According to the task description, the goal is to minimize cycles by rewriting the KernelBuilder.build_kernel function. The test includes a frozen copy of the simulator to prevent gaming the measurement system.

This isn't a LeetCode problem. There's no single algorithm to apply. The Hacker News discussion made that clear: one commenter noted that "packing vectors right" was proving difficult, while others debated whether the test favored rote knowledge of optimization patterns or genuine insight.

Anthropic's position appears to be: both matter, and the former is increasingly automatable.

Why release it now

The company's November 2025 blog post announcing Opus 4.5 highlighted the internal test results. Using parallel test-time compute (running multiple solution attempts and selecting the best), Opus 4.5 scored higher than any human in company history. Without that technique and without time limits, it tied the best-ever human.

Releasing the test now serves multiple purposes. It's a recruiting tool: Anthropic explicitly invites anyone who can beat 1,487 cycles to email [email protected]. It's also a statement about where AI capabilities are heading for technical work.

The caveats matter. Anthropic acknowledged the test doesn't measure collaboration, communication, or professional judgment. A two-hour kernel optimization exercise says nothing about whether someone can design systems over months or navigate organizational complexity. But it does measure something real, and that something is apparently within reach for current AI systems.

The scoreboard so far

The repository documents benchmark progression across Claude models:

Model Cycles Notes
Claude Opus 4 2,164 Many hours, test-time compute
Claude Opus 4.5 1,790 Casual session, matched best human
Claude Opus 4.5 1,579 2 hours, test-time compute
Claude Sonnet 4.5 1,548 Many hours, test-time compute
Claude Opus 4.5 1,487 11.5 hours, test-time compute
Claude Opus 4.5 1,363 Improved compute harness

Anyone can now attempt the challenge with unlimited time. The implicit question: can humans still compete when AI models get the same advantages?

The test runs locally in Python. Run python tests/submission_tests.py to see which thresholds your solution passes.

Oliver Senti

Oliver Senti

Senior AI Editor

Former software engineer turned tech writer, Oliver has spent the last five years tracking the AI landscape. He brings a practitioner's eye to the hype cycles and genuine innovations defining the field, helping readers separate signal from noise.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Anthropic Open-Sources Engineering Test That Claude 4.5 Solved Faster Than Humans | aiHola