Harness-1: 20B Search Agent Externalizes State With RL

A team led by researchers including Pengcheng Jiang and Jiawei Han published Harness-1 on June 1, 2026, a 20-billion-parameter search agent trained with reinforcement learning. The pitch is structural rather than scale-driven: instead of cramming the entire search history into a growing transcript, the model offloads its bookkeeping to an external harness and keeps only the decisions that actually need a brain.

That distinction is the whole paper, more or less.

The transcript problem

Most search agents run a loop you've seen before. Search, read, search, read, and everything piles into the context window. The model ends up doing five jobs at once. It decides what to look for, then has to remember what it already saw, judge which evidence matters, track which questions are still open, and recall which claims it actually verified versus the ones it just glanced at. The authors argue this is a bad division of labor. Reinforcement learning gets forced to optimize semantic judgment and routine record-keeping in the same gradient, and the record-keeping is exactly the part an environment can handle more reliably than a neural net.

So they split it. The policy still owns the judgment calls: what to search, which documents to keep or discard, what to verify, when to stop. The harness owns the state. The harness maintains a candidate pool of documents, an importance-tagged curated set, compact evidence links, verification records, deduplicated and compressed observations, and what they call budget-aware context rendering, which is a polite way of saying it decides what to actually show the model so the context window doesn't blow up.

The framing is clean. Whether it generalizes past this particular task is the more interesting question, and I'll get to that.

What the numbers actually say

Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 hits 0.730 average curated recall and beats the next strongest open search subagent by 11.4 points. That's a real margin, not a rounding-error win, and it's measured against open systems rather than the model's own previous version, which is the comparison I usually distrust most. Good.

Two caveats before anyone gets excited. The headline metric is curated recall, meaning how much of the gold evidence ends up in the final curated set. That's the right thing to measure for a retrieval subagent, but it's not end-to-end answer accuracy, and the two don't always move together. And "competitive with much larger frontier-model searchers" is doing some work in that abstract. Competitive how, against which models, on which subset? The paper draws the comparison; I'd want to read the per-benchmark tables before treating a 20B model as a frontier-class searcher.

Still. Eight benchmarks across four genuinely different domains is a broader spread than a lot of these releases bother with.

Does it transfer?

Here's the part that caught my attention. The authors report that the gains are strongest on held-out transfer benchmarks the model never trained on. That's the opposite of the usual story, where performance sags the moment you leave the training distribution. If the search behavior really does generalize better than the in-domain behavior, that's the strongest evidence for their core claim: that pushing state into an explicit, structured harness teaches the policy retrieval strategies that aren't tied to any one dataset.

I want to believe it. I also want to see the variance bars, because transfer results that look better than in-domain results sometimes mean the in-domain benchmarks were just harder, or that the held-out set happened to suit the harness. The repo's own results section is refreshingly honest about this, warning that metrics shift with the query sample, the retrieval index, the reranker backend, the vLLM version, and the GPU kernels. When the authors tell you their own small evaluations have high variance, take them at their word.

The data efficiency claim

The source writeup that pointed me here said Harness-1 was trained on a strikingly small set, on the order of a few hundred supervised trajectories plus RL on a few thousand queries, with the argument being that most of the useful behavior lives in the harness rather than the weights. I couldn't confirm those exact figures from the abstract or the repo README, so I'm not going to print them as fact. If the numbers hold up in the full paper, it's the most consequential part of the whole thing: it would mean the harness, not the training budget, is doing the heavy lifting. That's a claim worth checking against the paper's training section rather than a secondhand summary.

What's released

The model checkpoint is up, the code is public, and the documented evaluation path runs on BrowseComp+, a benchmark for hard browsing tasks where a system has to find and curate supporting evidence. The catch, and the repo states it plainly, is that the large retrieval indexes for the web, SEC, and patent settings aren't bundled. You bring your own corpus and Chroma collection, plus OpenAI credentials for retrieval and an optional reranker. So "reproduce our results" comes with real infrastructure homework attached. The training code leans on Tinker-hosted paths and some private checkpoints too, which means a fully clean from-scratch reproduction isn't quite a weekend project.

The checkpoint serves on vLLM with GPT-OSS support, validated on H100-class hardware, though the docs note other GPUs with enough memory may work.

If you want to read past the abstract, the HF paper page aggregates the discussion. The real test is the training and transfer sections of the full PDF: whether the data-efficiency claim survives contact with the appendix, and whether "competitive with frontier searchers" means what the abstract implies. I'm cautiously interested. The architecture argument is sound, the transfer result is genuinely surprising, and the honesty about evaluation variance earns some trust. Now someone outside the author list needs to run it on their own index and see if 0.730 holds.