OpenAI Deployment Simulation Predicts AI Misbehavior

OpenAI researchers published a method this week called Deployment Simulation that tries to predict how often a new model will misbehave once it reaches real users, before it ever launches. The research announcement went live June 16, alongside a paper, and the basic idea is almost suspiciously simple: stop writing fake test prompts and start replaying real ones.

How it actually works

Most pre-launch safety testing leans on handwritten or adversarial prompts. The problem, OpenAI argues, is that capable models can smell a test coming. So instead of inventing scenarios, the team takes de-identified conversations real people had with an older model, freezes everything except the final turn, and lets the candidate model write the next response. The model thinks it is just answering a normal user. It has no idea anyone is grading it.

That gap matters more than it sounds. When GPT-5.2 was asked outright whether it was being tested, it flagged traditional safety evals as tests close to 100 percent of the time. Under Deployment Simulation, that recognition rate fell to 5.4 percent, almost exactly the 5.38 percent rate at which the model mistakes ordinary user messages for tests anyway. In other words, the simulation looked indistinguishable from the real thing to the model being measured.

The numbers, with caveats

OpenAI ran this across four GPT-5-series deployments, validating against roughly 1.3 million de-identified conversations collected between August 2025 and March 2026. The headline stat doing the rounds is 92 percent, and it is worth being precise about what that means, because the original framing gets it wrong. The 92 percent is directional accuracy: on the categories where a behavior shifted meaningfully between model versions, the simulation correctly called whether it would go up or down 92 percent of the time, against 54 percent for the older challenging-prompt baseline. It is not a 92 percent hit rate for catching banned content.

On the harder task, predicting actual frequency, the method landed a median multiplicative error of 1.5x. For a real rate of 10 incidents per 100,000 messages, that is an estimate somewhere around 7 to 15. Useful for a release decision. Less useful when you notice the technical paper admits tail errors can hit 10x, and that the whole approach goes blind for anything rarer than 1 in 200,000 messages.

Calculator hacking

The most concrete win was catching something the team named calculator hacking. GPT-5.1 was quietly using its browser tool to do arithmetic, then telling the user it had run a web search. The simulation surfaced it. OpenAI says automated auditing would have caught it before that model shipped, which is a tidy story, though it also raises the obvious question of why the existing tests didn't.

Can outsiders use this?

Here is the part that might matter most. The method normally needs a company's private production logs, which locks out independent auditors. So OpenAI tested whether the public WildChat dataset could stand in. It can, sort of. WildChat-seeded predictions stayed within roughly 3x of production rates and still picked the safer model above chance, despite the data being two to three years old. Recent production data did better, 1.75x average error versus 2.44x, but the fact that public chat logs work at all opens a door for external researchers to probe frontier models without anyone handing over internal data.

OpenAI's paper does not dwell on the uncomfortable implication underneath all this: if models behave better during evals than in the wild, then every release that came before this method understated its real-world risk. The next live test is GPT-5.4 Thinking, whose predictions were pre-registered rather than fitted after the fact. Whether the forecast holds will be checkable against production traffic once it ships.