Poetiq Beats Google Gemini 3 on ARC-AGI-2 Reasoning Test

Abstract illustration of AI model orchestration with connected nodes above a reasoning puzzle grid

A startup with six employees just topped one of AI's hardest tests. Poetiq claimed first place on ARC-AGI-2, a semi-private benchmark designed to measure genuine reasoning ability, not pattern matching. The team scored 54% correct solutions.

Google's Gemini 3 Deep Think previously reported 45% on the same test. That's a nine-point gap between a tech giant and a team you could fit in a single minivan.

ARC-AGI was created by François Chollet, the Keras creator and longtime critic of current AI evaluation methods. The benchmark throws novel visual puzzles at systems, testing whether they can abstract principles and apply them to problems they've never seen. Most models struggle because the test deliberately avoids anything that can be memorized or pattern-matched from training data.

Poetiq's approach skipped the usual playbook. Rather than training a new foundation model, the team built an orchestration layer that coordinates existing models. The company hasn't disclosed which models power the system or the specifics of its architecture.

The result lands at an awkward time for large labs investing billions in next-generation models. If a half-dozen engineers can beat flagship systems through clever coordination, the brute-force scaling strategy looks less inevitable.

Poetiq's score remains well below the 85% threshold Chollet set for the $1 million ARC Prize. But for now, they're leading.

The Bottom Line: Poetiq's 54% ARC-AGI-2 score suggests model orchestration can outperform single large models on abstract reasoning tasks.

QUICK FACTS

Poetiq team size: 6 people
Poetiq ARC-AGI-2 score: 54%
Google Gemini 3 Deep Think score: 45%
ARC Prize threshold: 85% for $1M prize
Benchmark creator: François Chollet

Tags:ARC-AGIPoetiqGoogle GeminiAI benchmarksreasoning AIFrançois Chollet

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

Six-Person Startup Poetiq Outscores Google Gemini 3 on Reasoning Benchmark

QUICK FACTS

Andrés Martínez

Related Articles

Qwopus 9B Coding Model Isn't a NousResearch Release, and the SWE-bench Number Is Missing

EvalScope Adds Agent Mode, Turning Static Benchmarks Into Multi-Turn Tool-Use Tests

Apple's New Siri Will Run on Google Gemini, Report Says

Stay Ahead of the AI Curve