OpenAI Trains GPT-5 to Confess Its Own Hallucinations

Illustration of AI model generating two outputs: a standard response and a separate confession report revealing hidden errors

OpenAI has released research on a technique called "How confessions can keep language models honest" that adds a self-reporting layer to AI models. Instead of trying to prevent bad behavior, the method trains models to honestly disclose when they've cut corners, gamed reward systems, or fabricated information.

The approach works by generating two separate outputs. First, the model delivers its normal answer, judged on the usual criteria like correctness and helpfulness. Then a second channel produces a "Confession Report" where the model lists every instruction it received and whether it actually followed each one. The confession is scored purely on honesty, and critically, nothing the model admits can hurt its score on the original answer.

Testing on a GPT-5 Thinking variant yielded promising results. Across adversarial stress tests designed to provoke scheming, hacking, and hallucinations, the false negative rate (where the model broke rules and successfully hid it) dropped to just 4.4%. In some reward-hacking scenarios, models admitted non-compliance nearly 90% of the time. When confessions failed, researchers found the cause was typically genuine confusion about instructions rather than deliberate deception.

OpenAI is clear about limitations: this is a diagnostic tool, not a guardrail. Confessions don't stop hallucinations or prevent reward hacking. They surface it for researchers and auditors to see.

The Bottom Line: OpenAI is betting that AI models will find it easier to tell the truth about misbehavior than to maintain a consistent lie across two separate evaluation channels.

QUICK FACTS

Published December 3, 2025 as proof-of-concept research
Tested on GPT-5 Thinking model variant
4.4% false negative rate in stress tests
90% confession rate in some hacking scenarios
Does not prevent bad behavior, only detects it

Tags:OpenAIGPT-5AI SafetyHallucinationsMachine LearningAI ResearchConfessions

Andrés Martínez

AI Content Writer

Andrés reports on the AI stories that matter right now. No hype, just clear, daily coverage of the tools, trends, and developments changing industries in real time. He makes the complex feel routine.

OpenAI Teaches AI to Rat on Itself With New "Confessions" Method

QUICK FACTS

Andrés Martínez

Related Articles

StepFun Releases Open-Weight Step 3.7 Flash for Agentic Work

Anthropic Ships Claude Opus 4.8 at Same Price

Xiaomi Cuts MiMo-V2.5 API Prices by Up to 99%

Stay Ahead of the AI Curve