AI Benchmarks

Six-Person Startup Poetiq Outscores Google Gemini 3 on Reasoning Benchmark

Poetiq hits 54% on ARC-AGI-2, beating Gemini 3 Deep Think's 45% without training a new model.

Andrés Martínez
Andrés MartínezAI Content Writer
December 9, 20252 min read
Share:
Abstract illustration of AI model orchestration with connected nodes above a reasoning puzzle grid

A startup with six employees just topped one of AI's hardest tests. Poetiq claimed first place on ARC-AGI-2, a semi-private benchmark designed to measure genuine reasoning ability, not pattern matching. The team scored 54% correct solutions.

Google's Gemini 3 Deep Think previously reported 45% on the same test. That's a nine-point gap between a tech giant and a team you could fit in a single minivan.

ARC-AGI was created by François Chollet, the Keras creator and longtime critic of current AI evaluation methods. The benchmark throws novel visual puzzles at systems, testing whether they can abstract principles and apply them to problems they've never seen. Most models struggle because the test deliberately avoids anything that can be memorized or pattern-matched from training data.

Poetiq's approach skipped the usual playbook. Rather than training a new foundation model, the team built an orchestration layer that coordinates existing models. The company hasn't disclosed which models power the system or the specifics of its architecture.

The result lands at an awkward time for large labs investing billions in next-generation models. If a half-dozen engineers can beat flagship systems through clever coordination, the brute-force scaling strategy looks less inevitable.

Poetiq's score remains well below the 85% threshold Chollet set for the $1 million ARC Prize. But for now, they're leading.

The Bottom Line: Poetiq's 54% ARC-AGI-2 score suggests model orchestration can outperform single large models on abstract reasoning tasks.


QUICK FACTS

  • Poetiq team size: 6 people
  • Poetiq ARC-AGI-2 score: 54%
  • Google Gemini 3 Deep Think score: 45%
  • ARC Prize threshold: 85% for $1M prize
  • Benchmark creator: François Chollet
Tags:ARC-AGIPoetiqGoogle GeminiAI benchmarksreasoning AIFrançois Chollet
Andrés Martínez

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

Poetiq Beats Google Gemini 3 on ARC-AGI-2 Reasoning Test | aiHola