SkillsBench: AI Agents Use Skills Well But Can't Write Them

A team of 40 researchers from Amazon, ByteDance, Carnegie Mellon, Stanford, UC Berkeley, and Oxford have built the first benchmark that actually measures whether those "Skills" files everyone keeps stuffing into their AI agents do anything useful. The answer, published in a preprint paper this month, is a qualified yes: human-written Skills raise agent pass rates by an average of 16.2 percentage points across 84 tasks. But here's the catch that should give the skill-sharing ecosystem pause: when models tried to write their own Skills before tackling a task, performance dropped.

What are Skills, and why should you care?

If you've used Claude Code, Gemini CLI, or OpenAI's Codex, you've probably encountered Skills already. They're structured bundles of instructions, code templates, and domain-specific knowledge that get loaded into an agent's context at inference time. Think of them as cheat sheets: a SKILL.md file that tells the agent exactly which scipy function to call for flood-risk analysis, or how to parse a specific file format. No fine-tuning required, no model modification. Just procedural knowledge injected at runtime.

Skill directories have been proliferating fast. The problem was nobody had a rigorous way to measure whether any of it actually helped. SkillsBench, led by Xiangyi Li of agent measurement startup BenchFlow, is an attempt to fix that.

The setup

The benchmark includes 86 tasks (84 evaluated, two excluded for infrastructure reasons) spanning 11 domains, from healthcare and cybersecurity to energy systems and seismology. Each task was written by humans, not generated by models, and verified with deterministic pytest-based checks. No LLM-as-a-judge here, which is refreshing. The researchers ran seven model configurations: Claude Code with Opus 4.5, Opus 4.6, Sonnet 4.5, and Haiku 4.5; Gemini CLI with Gemini 3 Pro and Gemini 3 Flash; and Codex CLI with GPT-5.2. Five trials per task per condition. That's 7,308 valid trajectories in total.

Three conditions were tested: no Skills, curated (human-written) Skills, and self-generated Skills where the model was prompted to write its own procedural documentation before attempting the task.

Where Skills work, and where they don't

The headline number, +16.2 percentage points on average, obscures massive domain-level variation. In healthcare, the gain was +51.9pp. In software engineering? A meager +4.5pp. That pattern makes intuitive sense. Models are already drowning in code during pretraining. Giving them a Skills file about how to write Python is like handing a chef a recipe for toast. But clinical protocols, energy grid optimization, seismological analysis? That's procedural knowledge the model never saw during training, and Skills fill the gap.

The GitHub repo includes a nice example: a flood-risk analysis task where agents without Skills achieved a 2.9% pass rate because they didn't know to apply the Pearson type III probability distribution. With a curated Skill specifying the USGS methodology and the right scipy calls, pass rate jumped to 80%.

And 16 of those 84 tasks actually showed negative deltas with Skills. Sometimes the extra context just gets in the way.

Self-generated Skills are a bust

This is the result I keep coming back to. When models were told to analyze a task, write 1-5 modular Skill documents, and then solve the problem, they performed 1.3 percentage points worse on average than going in blind. Not the same. Worse.

The researchers designed this ablation to isolate whether the models' latent domain knowledge could substitute for human curation. It can't. As a Hacker News commenter pointed out, this isn't quite the same as having a model distill lessons from a completed task into a Skill for future use (which is how many developers actually use them). The paper tested cold generation: write the Skill before you've even attempted the problem. Still, the finding is striking. Models can consume procedural knowledge effectively but cannot produce it reliably.

The scale question

Curated Skills also act as a partial substitute for model size, which is the kind of finding that makes procurement decisions interesting. Claude Haiku 4.5 with Skills hit 27.7%, compared to 11.0% without. Claude Opus 4.5 without Skills managed only 22.0%. A smaller, cheaper model with the right procedural knowledge outperformed the flagship running bare.

Gemini 3 Flash posted the best absolute result among all configurations: 48.7% with Skills, up from 31.3% without. That was better than Gemini 3 Pro (41.2%), and competitive with Claude Opus 4.5 (45.3%) and GPT-5.2 (44.7%) despite being a much lighter model.

Less is more (seriously)

The paper reports that focused Skills with 2-3 modules outperform comprehensive documentation. The researchers don't elaborate on this as much as I'd like, but the implication tracks with what anyone who's worked with long contexts already suspects: past a certain point, more instructions create noise rather than signal. The agent spends tokens parsing your documentation instead of solving the problem.

This has practical consequences for the growing ecosystem of community-contributed Skills. More detailed is not better. The best Skill is apparently a tight SKILL.md with a couple of supporting files, not an encyclopedia.

What's missing

The paper acknowledges that Skills efficacy depends on how the agent harness implements them. Claude Code, Gemini CLI, and Codex CLI all handle Skills differently, and the paper doesn't fully untangle which performance differences come from the model versus the harness. That's a confound worth noting.

There's also the question of whether 84 tasks across 11 domains is enough to draw broad conclusions. The researchers recruited 105 contributors who submitted 322 candidate tasks, then curated down. The curation process seems rigorous, but some domains have only a handful of tasks. Drawing domain-level conclusions from small samples is always risky.

And the self-generated Skills test, while illuminating, only covers cold generation. The more common pattern in practice, where an agent learns from its mistakes and writes a Skill for next time, remains untested.

What comes next

BenchFlow is running the benchmark as an open-source project under an MIT license, with community contributions welcomed. The leaderboard is live on their site. As more models and agent harnesses hit the market, SkillsBench gives us at least one standardized way to ask: does this Skill actually help, or is it just context window filler?

For now, the takeaway is pragmatic. Write your Skills by hand. Keep them short. And don't trust the model to do it for you.