EvalScope Agent Mode Turns Benchmarks Into Tool-Use Tests

ModelScope has wired an agent evaluation path into EvalScope, its open-source model evaluation framework, letting developers run familiar benchmarks like GSM8K, AIME, IFEval and SWE-bench not as one-shot question-and-answer prompts but as multi-turn agent runs. The model generates a step, calls a tool, reads the observation, and goes again until it either solves the task or burns through its turns. The repo sits at roughly 2,500 stars and ships under Apache-2.0.

That's the pitch, anyway, and it's a sensible direction. In production a model rarely just answers in text. It calls functions, writes code, runs a shell or Python, gets something wrong, reads the traceback, patches its own solution, and keeps going. A static accuracy score on a frozen test set tells you almost nothing about that.

What's actually new here, and what isn't

Worth being precise. A lot of the agentic plumbing has been in EvalScope for months. The SWE-bench integration landed in late 2025, covering the Verified, Lite and mini variants and scoring patches through unit-test verification. Function-calling evaluation runs through general_fc and the BFCL leaderboards. And tau-bench already simulates multi-turn dialogues between an agent and a model-played user, with domain API toolsets and policy guardrails.

So the headline framing, that you flip on agent mode in a TaskConfig and an existing benchmark becomes an agent run, is more about packaging than invention. Which is fine. Packaging is the whole value proposition of an eval harness. The promise that you don't rewrite the old benchmarks, you just toggle the config, is the part that matters to anyone who's maintained a pile of brittle eval scripts.

I'll flag one thing: I couldn't find a docs page sitting under a clean "Agent Evaluation Mode" heading with a single switchable strategy flag. The pieces are documented across separate benchmark pages. So treat the one-line-config claim as the announcement's framing until the unified docs show up.

The sandbox is the real story

Here's the part that's easy to skip past and shouldn't be. Agent benchmarks that let a model run shell and Python without isolation are, functionally, you executing untrusted generated code on your box. EvalScope routes this through Docker and ms-enclave, described as a modular agent sandbox runtime. Code benchmarks like HumanEval and LiveCodeBench gained sandboxed execution in the second half of 2025, with both local and remote run modes added shortly after.

This is the unglamorous infrastructure that determines whether an agent eval is something you can actually run on shared hardware or a liability you quietly avoid. Good that it's there by default.

Traces, not just a number in a table

The other genuinely useful bit: full trace capture. Every step, every tool call, every error, every observation, replayable in the Gradio-based dashboard (the evalscope app command, served on port 7861). For debugging an agent that fails on turn seven of a SWE-bench task, a single mean accuracy figure is useless. You need to see where the loop went sideways.

And this is where I'd push back on the typical agent-eval pitch. A multi-turn pass rate is a noisier signal than people pretend. The SWE-bench Verified leaderboard itself notes that results from mini-SWE-agent 1.x and 2.x aren't directly comparable, because 2.x switched to tool-calling to invoke actions while 1.x parsed actions out of raw output strings. Same model, same problems, different scaffold, different scores. So when you compare two models under "agent mode," you're partly comparing the agent loop, the prompt template, the max-turn budget, and the tool schema. Not just the model.

The trace viewer at least lets you see that, instead of trusting the leaderboard cell.

So is it worth it?

If you're shipping anything that calls tools, probably yes. The shift in how we evaluate AI systems is moving from "which model is smarter on a frozen test" toward the messier question of how a model behaves inside a loop, with tools, errors, constraints and real environment state. That second question is the one that breaks things in production.

Just don't read an agent-mode pass rate as gospel. Read the traces. Note your turn budget. Check what the sandbox actually allowed. The number on its own travels worse than it looks.

EvalScope's latest tagged release is v1.5.0, from March 2026. The agent-mode docs and config examples live in the project's Read the Docs site, and the SWE-bench Pro request is still an open issue, so the benchmark coverage is clearly still moving.

EvalScope Adds Agent Mode, Turning Static Benchmarks Into Multi-Turn Tool-Use Tests

What's actually new here, and what isn't

The sandbox is the real story

Traces, not just a number in a table

So is it worth it?

Oliver Senti

Related Articles

Google's Gemini 3.5 Flash Beats Its Own Flagship at I/O 2026

Qwopus 9B Coding Model Isn't a NousResearch Release, and the SWE-bench Number Is Missing

Skywork releases SkyClaw-v1.0 model for tool-use agents

Stay Ahead of the AI Curve