Agents

Sakana AI and KPMG Launch CoffeeBench to Test LLM Business Agents

GPT-5.5 topped the 90-day coffee supply chain sim. Claude Haiku 4.5 kept planning but stopped acting.

Andrés Martínez
Andrés MartínezAI Content Writer
June 27, 20264 min read
Share:
Conceptual visualization of autonomous AI agents trading goods across a coffee supply chain network

Sakana AI and KPMG AZSA LLC released CoffeeBench on June 26, a benchmark that drops a language model into a simulated coffee supply chain and watches whether it can run a business for 90 simulated days without going broke. The model under evaluation plays one of two roasters. Everyone else at the table, the farmers and retailers and the rival roaster, runs on Claude Sonnet 4.6.

The setup matters more than the coffee. Most agent benchmarks put one model against a passive environment that just sits there and reacts. Here the environment pushes back, because every other firm is also an autonomous agent chasing its own profit. The technical paper frames this as the gap nobody was testing: real economies are full of heterogeneous firms negotiating against each other, not a single agent poking a static world.

What the roaster has to do

Buy green beans. Roast them. Sell roasted product. Pay invoices on net-30 terms before late fees pile up. Watch inventory spoil at half a percent a day while fixed costs of thirty dollars drain the account every morning whether you trade or not. Go cash-negative and you are declared bankrupt and removed from the market, your unpaid bills written off as someone else's bad debt.

Each run burns roughly $250 in API costs across all six firms and takes about eight hours, mostly waiting on API latency. The agent issues somewhere near a thousand tool calls per run. This is not a quick eval.

The standings

GPT-5.5 finished first with mean net income of +$3,109 over three runs, ahead of Claude Opus 4.7 at +$2,782. Both leaned hard on actually executing trades and talked constantly to counterparties. GPT-5.5 sent about 140 messages a run. That chattiness tracks with the paper's main behavioral finding: the models that communicate more tend to make more money.

Then it gets messier. GLM-5.1 posted the highest revenue of anyone, $16,962, and still landed mid-pack on net income at +$1,597. High sales, thin margins, which is its own kind of lesson. Gemini 3.1 Pro barely spoke, sending just 16 outbound messages, yet cleared +$1,695 by reading inbound messages and reacting rather than initiating. The authors call it efficient. You could also call it lucky, given the variance.

And the variance is real. Run-to-run spread is wide enough that the authors ran each setting three times and still flag statistical reliability as a limitation. The standard deviations on net income often run into the thousands, which is worth keeping in mind before treating any single ranking as gospel.

The interesting failure

Claude Haiku 4.5 was the only model to post a loss, mean net income of -$630. The reason is the part of the paper worth reading. Haiku did not crash or produce garbage. It wrote coherent assessments, laid out plans, and then repeatedly chose to wait rather than act, sitting idle on roughly 40 of the 90 days. The authors named this idle-drift: reasoning stays intact, action evaporates.

The agent "maintains coherent reasoning traces yet repeatedly chooses to wait rather than act," per the paper, which is a more unsettling failure than a model that simply gets confused. A model that knows what to do and does nothing is harder to debug than one that's just wrong.

Kimi K2.6 squeaked out a small positive at +$454, executing nearly as many tool calls as GPT-5.5 but barely negotiating, which the authors read as proof that trade volume without proactive pricing does not pay. Both rule-based baselines, a do-nothing PassiveRoaster and a fixed-policy HeuristicRoaster, lost money outright.

Where to look

The code is public, and Sakana published the full agent trajectories so you can read the actual reasoning traces and inter-agent messages behind the numbers. The benchmark was accepted to the FAGEN workshop at ICML 2026, which runs July 6 to 11 in Seoul, not the main conference track. Worth noting before anyone inflates the credential.

Tags:Sakana AIKPMGLLM agentsAI benchmarksGPT-5.5Claudeagentic AIsupply chainICML 2026
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.