Qwopus 9B Coder: What the Benchmarks Actually Say

A developer publishing as Jackrong has put out Qwopus3.5-9B-coder, a fine-tune of Qwen3.5-9B aimed at tool calling and agentic coding, on Hugging Face. It surfaced in the last few weeks and started making the rounds with a tidy framing: NousResearch dropped a 9B that scores in the low 50s on SWE-bench. Neither half of that is quite right.

Start with the attribution. The model card lists the base as Qwen3.5-9B, the author as Jackrong, and the license as Apache 2.0. NousResearch shows up only as the maker of the Hermes Agent harness the model is tuned to run inside, and as the format standard its training data follows. So this is a solo community fine-tune that learned to drive someone else's agent runner. Calling it a NousResearch model is like calling a custom ROM an official Google phone.

About that SWE-bench score

I went looking for the 53.33% figure. It isn't on the card. What's there instead are three small, locally-run test sets: HermesAgent-20, ToolCall-15, and BugFind-15. Twenty tasks, fifteen tasks, fifteen tasks. The numbers got tested on a Mac through LM Studio using a community suite, and the developer says every run used temperature 1 with two retries allowed before a task counts as failed.

On HermesAgent-20 the fine-tune scores 85 against 71 for stock Qwen3.5-9B. That's a real jump on the metric the author cares most about, and I'd believe a tuned model beats its base on the exact task distribution it was trained for. The harder question is what 85 on twenty hand-built tasks actually tells you about anything else. Probably not much.

The detail that didn't make the hype, buried in the second table: on the tool-call stability set, the fine-tune and the untouched base model both score a flat 100. Tied. If the headline pitch is "better at tool calling," the one benchmark explicitly about tool-call stability says the base was already maxing it out. The improvement is real but it's living almost entirely in the agentic-reasoning column, not in raw function-calling reliability.

How it was trained, which is the actually interesting part

The pitch the author makes for the size class is loud. The card calls the 9B dense format a "sweet spot" and claims this is "the best open-source model in its class," which, fine, every model card says that. The training method is where it gets weird in a good way.

Two ingredients. One is a published set of reasoning traces from Lambda: 7,646 multi-turn trajectories where an agent actually ran terminal commands, edited files, and drove a browser, with the real outputs kept rather than faked. Worth flagging a small inconsistency here. The dataset's own page says it was generated with Moonshot's Kimi-K2.5. The model card says the traces came from GLM-5.1 and Kimi. Someone's description is stale, and it isn't a detail you'd want wrong in a methods section.

The second ingredient is the one that made me sit up. The author describes a "Trace Inversion" pipeline: take the compressed, sanitized reasoning summaries that commercial APIs expose, then use a small surrogate model to reconstruct a plausible full chain-of-thought, then train on the reconstruction. The card names Claude and GLM and DeepSeek as targets. So you're not distilling a model's actual thinking. You're distilling a 4B model's guess at what the thinking might have been, dressed up inside <think> tags. Does that transfer real capability or just real-looking reasoning theater? I genuinely don't know, and the card doesn't run the ablation that would tell us.

The teaser nobody can find

The source post promised a Qwopus 3.6 27B coming soon. I couldn't find it referenced on the card at all. There's a stray mention of a "Qwen3.6 MoE base model" in the acknowledgements, on a model whose base is Qwen3.5, so even the lineage notes are a little tangled.

And the developer is upfront about the limits, to their credit. The card flags this as an experimental, research-only build and warns of "Capability Decay" outside coding and tool use. It's had roughly 19 downloads and three likes as of this writing. Which is the honest scale of the thing.

What to watch

If you want to know whether this is good, the move is obvious: run it yourself, hot, at temperature 1 like the author recommends, against tasks you actually care about. The GGUF quants are all there, from a 3.8GB Q2 up to the BF16. But treat the leaderboard numbers as what they are: a developer grading their own homework on a twenty-question test. The SWE-bench claim needs a SWE-bench run before anyone repeats it.

Qwopus 9B Coding Model Isn't a NousResearch Release, and the SWE-bench Number Is Missing

About that SWE-bench score

How it was trained, which is the actually interesting part

The teaser nobody can find

What to watch

Oliver Senti

Related Articles

EvalScope Adds Agent Mode, Turning Static Benchmarks Into Multi-Turn Tool-Use Tests

DeepSeek Sparse Attention Gets a From-Scratch Implementation Built for Reading

New Chronicles-OCR benchmark catches frontier vision models scoring near zero on ancient Chinese scripts

Stay Ahead of the AI Curve