OpenAI publishes guide on how engineers use Codex

OpenAI has published a 12-page PDF documenting how its internal teams use Codex, the company's AI coding agent. The document covers seven use cases and six best practices, illustrated with about a dozen anonymous quotes from engineers on the security, infrastructure, frontend, API, and performance teams.

It's a useful document. It's also, pretty clearly, a marketing piece dressed as a case study.

The pitch

Each use case follows the same template: short description, two or three anecdotes from unnamed engineers, three sample prompts. The seven categories are code understanding, refactoring and migrations, performance optimization, test coverage, development velocity, staying in flow, and exploration. There's no real narrative connecting them. Each section reads like it was written separately and stapled together.

One quote in particular has been circulating since the document started getting attention. A Product Engineer on the ChatGPT Enterprise team is quoted saying they were "in meetings all day and still merged 4 PRs" with Codex running tasks in the background.

Look, I get why this lands. It is the dream pitch for any developer tool: do less, ship more. But four PRs by an engineer who was in meetings all day raises questions the document doesn't answer. What size were the PRs? Who reviewed them? Were they merged because they were good, or because someone needed to clear the queue?

What's actually in there

Use cases themselves are unsurprising if you have used any of these tools. Engineers paste stack traces and ask where the auth flow lives. They use Codex to swap a legacy function across dozens of files. They point it at low-coverage modules overnight and review the resulting test PRs the next morning (or, more realistically, skim them). Standard agent-coding stuff.

Best practices are the better half. OpenAI recommends starting in Ask mode for large changes to get an implementation plan, then switching to Code mode to execute. Prompts should look like GitHub issues, with file paths and references to other modules. There's guidance on maintaining an AGENTS.md file with naming conventions, business logic, and quirks the model can't infer from code alone. And there's Best-of-N, which generates multiple responses to the same prompt so you can pick or splice the best ones. This is accumulated tribal knowledge, and it's worth more than the testimonials.

What's missing

No real metrics. None.

The document claims engineers can save half an hour of work for five minutes of prompting, that PRs get merged in the background, that tests get written overnight. There are no numbers on how often Codex's PRs get rejected, how much review time they actually consume, what fraction of generated tests get scrapped, or how failure rates compare across teams. The introduction promises insights drawn from engineer interviews and internal usage data. The rest of the document doesn't deliver on the second half.

Compare with OpenAI's harness engineering post, where a team of seven engineers shipped roughly 1,500 PRs over five months on a million-line codebase, with no human-written code at all. That's an actual number. It's also a number that should make you a little nervous about what's getting reviewed and how.

Who reads all these PRs?

The PDF doesn't engage with the obvious question: if everyone is firing off Codex tasks in the background, who is reviewing the resulting PRs? Writing about Codex's internal use at OpenAI, Gergely Orosz noted that "the traditional PR flow is starting to crack" under the volume. That observation isn't in OpenAI's document. It probably should be.

There's a related tension in use case six, the one about staying productive when calendars are packed with interruptions. The framing treats meetings as immovable, so engineers fire off background tasks to compensate. You can read this two ways. Either Codex helps engineers reclaim time eaten by calendar chaos, or it's the tool that makes the calendar chaos sustainable by quietly absorbing the cost of context-switching. The document is firmly in camp one. Camp two is probably more honest.

What it's actually for

This document isn't aimed at journalists or skeptics. It's aimed at engineering leads at potential enterprise customers, with the production values of a press kit and the structure of a procurement justification. Use cases map to budget categories. Best practices give buyers something to point at when defending a rollout. Anonymous testimonials provide air cover.

It's still worth reading. Just don't confuse it with data.