How OpenAI's Codex CLI Actually Works Under the Hood

OpenAI published extensive documentation on the inner workings of Codex CLI, its open-source terminal-based coding agent. The technical guide lays out the agent loop mechanics, explains why prompt caching is fragile, and introduces a compaction system that lets agents run for hours without hitting context limits.

The Agent Loop Is Simple Until It Isn't

At its core, Codex CLI runs what OpenAI calls an agent loop. You send a request, the system assembles a prompt with instructions pulled from AGENTS.md files scattered through your project directories, and the model responds with either text or a tool call. If it's a tool call, the CLI executes it locally, stuffs the result back into the prompt, and hits the API again.

This continues until the model decides the task is done.

The catch: all those project instructions from AGENTS.md files get aggregated into a single JSON blob capped at 32KB. That's it. More instructions won't fit, so you'd better be economical with your guidance.

Prompt Caching Isn't Magic

Here's where it gets interesting for anyone building their own agents. The conversation history grows with every exchange. You're resending that entire tail on every API call. Without intervention, costs scale quadratically with conversation length.

OpenAI uses prompt caching to avoid this. If your prompt prefix stays identical between requests, you get a cache hit and inference remains linear. Sounds reasonable.

But the cache is brittle. Change the order of tools in your list. Tweak your sandbox configuration. Adjust a single instruction. The prefix no longer matches, cache miss, and you're back to paying for the full context reprocessing.

The configuration reference includes settings for auto_compact_token_limit that trigger automatic compaction when you approach the limit. But if you're running Codex with custom configurations that differ between sessions, expect cache hit rates to suffer.

The Stateless Shift

OpenAI's Responses API moved to a fully stateless model. The previous_response_id parameter that let you chain conversations by reference is gone for Zero Data Retention configurations.

For organizations that can't store anything on OpenAI's servers (healthcare, finance, certain enterprise setups), this creates a problem. How do you maintain conversation context across turns when the server won't remember anything?

The answer is encrypted reasoning content. When you include reasoning.encrypted_content in your API request's include field, OpenAI returns an encrypted blob representing the model's reasoning state. You store it client-side, pass it back with the next request, and the server decrypts it in memory, uses it, then discards it. No persistence, but you keep the context.

It's clever engineering for a real compliance need. It's also more complexity pushed to the client.

Compaction Is the Real Story

The compaction system is what makes multi-hour coding sessions possible. Previously, developers handled this with ad-hoc summarization, asking the model to condense the conversation so far and using that summary going forward. Results varied.

OpenAI's implementation works differently. When the context window fills up, you call the /responses/compact endpoint. The model generates a compressed representation of the conversation, but rather than a human-readable summary, it's a structured item you pass back into future requests. The model trained to understand these compaction items, so it can reconstruct relevant context from them.

The changelog notes that GPT-5.2-Codex specifically includes improvements for "long-horizon work through context compaction." The model isn't just tolerating compacted context; it was trained for it.

This happens automatically when you hit the auto_compact_limit threshold. The old /compact slash command still exists for manual triggers, but the system no longer depends on users remembering to invoke it.

Local Models via Ollama

One detail buried in the command line options: the --oss flag connects Codex CLI to local models through Ollama or LM Studio. The same agent loop, the same tool calling infrastructure, but running inference locally.

Whether local models can match the trained behaviors around compaction and tool use is another question. The prompting guide explicitly notes that the model was "primarily trained using a dedicated terminal tool," suggesting the exact tool definitions matter for performance. Swap in a different model with different training, and your mileage may vary significantly.

What This Means for Agent Builders

If you're building coding agents, a few things stand out. The prompt caching sensitivity means your agent harness needs to be very consistent about prompt construction. Random reordering breaks the cache. Dynamic instruction insertion breaks the cache.

The compaction system is available through the Responses API, not just Codex CLI. The API reference documents the endpoint. Anyone building long-running agents can use it.

And the ZDR-compatible encrypted content pattern is worth studying if you're in a regulated industry. Client-side state with server-side decryption avoids the usual tradeoff between compliance and capability.

The GitHub repository for Codex CLI is Apache-2.0 licensed. OpenAI suggests using Codex itself to explore how things are implemented, which is either charmingly recursive or a sign that the codebase has gotten complex enough that even they recommend an AI assistant for navigation.