In September 2025, three OpenAI researchers and a Georgia Tech professor published a paper that should have been more uncomfortable for their employer than it was. The paper, "Why Language Models Hallucinate," lays out a mathematical framework proving that large language models will always generate confident falsehoods, no matter how clean the training data or how much compute you throw at the problem. Not sometimes. Not until the next update. As a structural property of how these systems work.
The authors, Adam Tauman Kalai, Ofir Nachum, Edwin Zhang, and Santosh S. Vempala, didn't hedge much. They showed that the generative error rate for language models is bounded below by at least twice the misclassification rate of a simpler binary task: just deciding whether a statement is valid or not. If distinguishing true from false is hard (and it often is, especially for rare facts), then generating correct text is harder still. Always.
The exam analogy holds up uncomfortably well
The paper's central metaphor is a student facing an exam. When you don't know the answer, you guess, because leaving it blank gets you the same score as getting it wrong: zero. Language models face exactly this incentive structure, and the OpenAI team went looking at whether the industry's own testing regime reinforces it.
They examined ten major AI benchmarks, the ones that companies and leaderboards use to rank models against each other. Nine of them use binary grading. Say "I don't know" and you get zero points, same as a wrong answer. The mathematically optimal strategy under those rules is obvious: always guess. The paper frames this as an "epidemic" of penalizing honesty, which is a strong word coming from researchers at the company that runs ChatGPT.
Think about what this means for the training pipeline. Models learn to be good test-takers. And the tests reward confident guessing over calibrated uncertainty. So that's what we get.
The numbers from OpenAI's own models are worse than the theory
Here's where it gets awkward. Months before this paper dropped, OpenAI had already published hallucination rates for its newer reasoning models in the o3 and o4-mini system card. On PersonQA, their in-house benchmark for factual accuracy about people, o1 hallucinated 16% of the time. The newer o3? 33%. And o4-mini hit 48%.
Nearly half. Their newest, smallest reasoning model was wrong about people almost half the time it was asked.
OpenAI's explanation in the system card is that o3 tends to make more claims overall, so it gets more right and more wrong. Which is technically true and also kind of beside the point if you're a user who can't tell which claims are which. TechCrunch reported that third-party testing by Transluce, a nonprofit AI research lab, found o3 fabricating entire processes, including claiming it ran code on a 2021 MacBook Pro "outside of ChatGPT." It can't do that. It just said it did.
The fix that would kill the product
The paper does propose a solution. Set a confidence threshold. If the model isn't sure enough, have it say "I don't know" instead of guessing. The math works out cleanly: under the right threshold, hallucinations drop because the model stops bluffing on questions it can't answer.
But the paper's own analysis suggests that roughly 30% of queries would go unanswered. Picture that. You ask ChatGPT something and three times out of ten it just... shrugs. As Wei Xing, an assistant professor at the University of Sheffield, wrote in The Conversation, users accustomed to confident answers would abandon such a system almost immediately. He drew a comparison to an air quality monitoring project in Salt Lake City where flagging measurement uncertainty led to lower user engagement, even though the uncertain readings were more honest than the confident ones.
That's the bind. The solution exists. It provably reduces hallucinations. And it would crater engagement metrics so fast that no consumer AI company would ship it.
Three labs, same wall
OpenAI isn't the only group reaching this conclusion, which makes it harder to dismiss as one company's internal pessimism. Researchers at Tsinghua University's NLP lab published work in December 2025 (arXiv: 2512.01797) identifying specific neurons, fewer than 0.01% of the total, that drive hallucination behavior. They called them H-Neurons. The finding: these neurons don't store wrong information. They encode the preference for producing any answer over admitting uncertainty. Suppress them, and hallucination drops. But so does the model's willingness to answer at all.
Separately, Oxford researchers published in Nature in 2024 showing that models produce arbitrary wrong answers when uncertain, and proposed semantic entropy as a detection method. Sebastian Farquhar, the lead author (who later joined Google DeepMind), was careful to note this only catches one type of error: confabulation, where the model gives different answers each time you ask. Consistent mistakes slip through.
So we have one group proving hallucinations are mathematically inevitable. Another locating the exact neurons responsible. A third developing detection methods while acknowledging they only catch a fraction of the problem. The convergence is hard to ignore.
What the benchmarks actually reward
I keep coming back to the benchmark finding because it's the most actionable part of this whole story. Nine out of ten major AI evaluations, including GPQA, MMLU-Pro, and SWE-bench, grade in a way that makes honesty a losing strategy. The OpenAI paper proposes reforming these benchmarks to reward calibrated confidence, penalizing wrong answers more heavily (say, minus three points) while allowing abstention at zero cost.
On paper, this works. In practice, who moves first? If one company starts training models that say "I don't know" more often, their benchmark scores drop relative to competitors who keep training aggressive guessers. The OpenAI blog post accompanying the paper tries to reframe this optimistically, arguing that hallucinations aren't inevitable if models can abstain. But that framing quietly concedes the point: the only way to avoid hallucinating is to not answer, and not answering isn't what anyone is paying for.
Neil Shah, VP for research at Counterpoint Technologies, put it bluntly in Computerworld's coverage: unlike human intelligence, AI lacks the humility to acknowledge uncertainty. It presents estimates as facts. And we've built the entire evaluation ecosystem to encourage exactly that behavior.
The FTC hasn't weighed in on hallucination specifically, and there's no regulatory deadline forcing anyone's hand. OpenAI's paper calls for industry-wide evaluation reform. The paper itself is the most concrete step anyone has taken toward that, and it's a research paper, not a product change. The models shipping today still guess. The benchmarks still reward it. And nearly half of what o4-mini says about real people is wrong.




