A startup with six employees just topped one of AI's hardest tests. Poetiq claimed first place on ARC-AGI-2, a semi-private benchmark designed to measure genuine reasoning ability, not pattern matching. The team scored 54% correct solutions.
Google's Gemini 3 Deep Think previously reported 45% on the same test. That's a nine-point gap between a tech giant and a team you could fit in a single minivan.
ARC-AGI was created by François Chollet, the Keras creator and longtime critic of current AI evaluation methods. The benchmark throws novel visual puzzles at systems, testing whether they can abstract principles and apply them to problems they've never seen. Most models struggle because the test deliberately avoids anything that can be memorized or pattern-matched from training data.
Poetiq's approach skipped the usual playbook. Rather than training a new foundation model, the team built an orchestration layer that coordinates existing models. The company hasn't disclosed which models power the system or the specifics of its architecture.
The result lands at an awkward time for large labs investing billions in next-generation models. If a half-dozen engineers can beat flagship systems through clever coordination, the brute-force scaling strategy looks less inevitable.
Poetiq's score remains well below the 85% threshold Chollet set for the $1 million ARC Prize. But for now, they're leading.
The Bottom Line: Poetiq's 54% ARC-AGI-2 score suggests model orchestration can outperform single large models on abstract reasoning tasks.
QUICK FACTS
- Poetiq team size: 6 people
- Poetiq ARC-AGI-2 score: 54%
- Google Gemini 3 Deep Think score: 45%
- ARC Prize threshold: 85% for $1M prize
- Benchmark creator: François Chollet




