AI Benchmarks

Qwen Releases DeepPlanning, a Benchmark That Breaks Frontier AI Models

Multi-day travel and shopping tasks reveal fundamental weaknesses in how LLMs handle real-world constraints.

Liza Chan
Liza ChanAI & Emerging Tech Correspondent
January 27, 20264 min read
Share:
Abstract illustration of an AI agent struggling to complete a complex travel planning network with constraint violations highlighted

Alibaba's Qwen team released DeepPlanning, a benchmark designed to test whether AI agents can actually plan across extended timeframes with hard constraints. The results, published January 26, show even the best models failing more often than succeeding.

The gap nobody was measuring

Most AI benchmarks test step-by-step reasoning. DeepPlanning does something different: it requires agents to satisfy global constraints across an entire plan. Miss your budget by a dollar on day three of a five-day trip? The whole thing fails.

The benchmark includes 240 travel planning tasks and 120 shopping optimization problems. Travel tasks demand minute-level itineraries where flight times, attraction hours, and transit durations must align without overlaps or budget violations. Shopping tasks require navigating coupon stacking rules and cross-store discounts to hit the lowest possible price.

What makes this hard isn't the individual steps. It's that a locally optimal decision can torpedo the entire plan. Book the cheapest flight and you might miss a connection that breaks your whole itinerary.

GPT-5.2 leads, but "leading" means 44%

The leaderboard puts GPT-5.2 at the top with 44.6% average accuracy across both domains. Claude 4.5 Opus with thinking enabled sits at 33.9%. Gemini 3 Flash Preview comes in at 28.8%.

These numbers demand context. On travel planning specifically, GPT-5.2 achieves an 85.8% composite score combining commonsense and personalization metrics, but only 35% of its complete itineraries pass all validation checks. The gap between "mostly right" and "actually works" is wider than the raw scores suggest.

Shopping planning shows a similar pattern. Models score well on matching individual product requirements but collapse when optimizing across the full cart. Qwen-Plus without thinking mode hits 0% case accuracy on travel despite a 28.9% composite score.

Reasoning mode helps, sort of

The benchmark tested models with and without extended thinking. Claude 4.5 Opus jumps from 26.3% to 33.9% with thinking enabled. Qwen3-Max goes from 12.8% to 28.7%. DeepSeek-V3.2 improves from 5.3% to 21.6%.

But thinking mode isn't a silver bullet. o3, OpenAI's reasoning-focused model, scores just 24.9% overall despite strong individual metrics. It seems capable of rigorous local reasoning while still producing globally invalid plans.

The Qwen team notes that reliable parallel tool use correlates with better scores. Models that can gather information from multiple APIs simultaneously make fewer errors than those processing sequentially.

What the failures reveal

Travel planning exposes how models handle temporal reasoning. The benchmark checks 21 specific requirements across route consistency, time feasibility, business hours, and cost accuracy. Models routinely schedule activities during closed hours or book transit that arrives after departure.

Shopping tasks surface combinatorial optimization failures. Coupon rules in DeepPlanning interact in ways that require considering the entire cart simultaneously. Most models apply coupons greedily rather than optimizing globally.

The evaluation uses code-based automated checking rather than LLM-based scoring, eliminating the possibility that models are gaming soft metrics. Either the plan validates against all constraints or it doesn't.

What's actually being tested

DeepPlanning isolates three capabilities that separate genuine planning from sophisticated pattern matching. First, proactive information gathering: agents must actively query APIs to discover hidden states rather than hallucinating facts. Second, local constrained reasoning: each step must satisfy specific requirements like matching hotel amenities or product attributes. Third, global constrained optimization: the plan as a whole must stay within budget and time boundaries.

The dataset includes Chinese and English travel tasks plus English shopping tasks, with toolkits of 9 and 15 specialized APIs respectively. Each task runs in an isolated Python sandbox with up to 400 tool calls allowed.

Reading the tea leaves

These results complicate the narrative around AI agent capabilities. Models that excel at coding benchmarks and mathematical reasoning still struggle with the kind of planning humans do when booking a vacation. The bottleneck isn't intelligence in any abstract sense. It's maintaining coherent constraints across time.

Whether this gap closes quickly or slowly will shape how fast AI agents move from demos to deployment. The benchmark provides a concrete target. The code and data are open source.

Liza Chan

Liza Chan

AI & Emerging Tech Correspondent

Liza covers the rapidly evolving world of artificial intelligence, from breakthroughs in research labs to real-world applications reshaping industries. With a background in computer science and journalism, she translates complex technical developments into accessible insights for curious readers.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Qwen Releases DeepPlanning, a Benchmark That Breaks Frontier AI Models | aiHola