AA-Briefcase: New Business AI Benchmark From Artificial Anal

Artificial Analysis released AA-Briefcase on June 18, a benchmark that drops AI models into 91 messy, multi-week business projects and grades whether they can actually finish the work. The scenarios were built with experts from Google, McKinsey and Boston Consulting Group, spanning data science, product management, banking operations and heavy industry strategy.

The headline result is less about who won than how little winning there was.

Nobody is good at this yet

Claude Fable 5 tops the leaderboard, followed by Claude Opus 4.8 and GLM-5.2. But the pass rates are humbling. Fable 5 leads on rubric checks and still satisfies every single criterion on only 3% of tasks. On 31 of the 91 tasks, no model scored above 50%. So the best business-analyst agent money can buy fully completes roughly one task in thirty.

That framing matters because the benchmark grades three different things: a binary rubric for whether the work is correct, plus pairwise comparisons on analytical quality and presentation. A model can produce something that looks like a polished board deck and still flunk the rubric because it missed a requirement buried in a Slack thread. The eval is built to catch exactly that gap.

The price problem

Cost per task swings by more than 800x. Fable 5 runs north of $31 per task on average. At the floor, DeepSeek V4 Flash sits around four cents. None of the cheapest models get anywhere near frontier quality, so the spread isn't really a bargain hunt.

The interesting middle is GLM-5.2, the strongest open-weight option, which lands about 90 Elo behind Opus 4.8 for less than a quarter of the cost. Whether 90 Elo of analytical rigor is worth a 4x price premium depends entirely on what you're using it for, and Artificial Analysis is careful not to answer that for you.

Why the leaders keep looking at their own work

One finding is hard to ignore. The models that score best on presentation are the ones that repeatedly render and visually inspect their outputs before submitting. Fable 5 averages around 21 image inspections per task and Opus 4.8 around 12. Gemini 3.1 Pro Preview averages roughly 0.1, frequently shipping files it never looked at.

The strongest presentation models inspect their rendered outputs far more often before submitting. Read that as a quiet indictment of every model that doesn't bother.

It tracks with how a competent human works. You don't email the slide without opening it first.

What it measures, and what it doesn't

Each task buries requirements across hundreds of source files: emails, Slack exports, transcripts, spreadsheets. Nearly 2,000 files in total. Pass rates fall as the number of required files climbs, and weaker models degrade faster. High-intelligence models drop from about 55% on prompt-only checks to roughly 40% once five or more files are involved.

The four scored scenarios stay private to avoid contamination. A fifth, a private-equity due diligence project, is public on Hugging Face as AA-Briefcase Lite, though it doesn't count toward the official scores. The benchmark runs on Stirrup, the firm's open-source agent framework, available on GitHub.

Artificial Analysis says it will update the leaderboard as new models ship. The next frontier release will be the real test of whether 3% was a ceiling or just a starting line.

Artificial Analysis Launches AA-Briefcase Benchmark for Business AI Agents

Nobody is good at this yet

The price problem

Why the leaders keep looking at their own work

What it measures, and what it doesn't

Oliver Senti

Related Articles

Writer Study Finds AI Memory Features Worsen Sycophancy

Google Sues China-Based Outsider Enterprise Over Gemini Phishing Scam

Google's Gemini-SQL2 Tops BIRD Text-to-SQL Leaderboard

Stay Ahead of the AI Curve