Harness Engineering: The New AI Coding Discipline

Within six days of each other, Mitchell Hashimoto and a team at OpenAI published remarkably similar ideas about what software engineering looks like when AI agents write all the code. Both landed on the same term: harness engineering.

Hashimoto, the co-founder of HashiCorp and creator of Ghostty, described it in a blog post on February 5th. OpenAI's Ryan Lopopolo followed on February 11th with a longer writeup about building an entire product where humans never directly contributed any code. The convergence wasn't coordinated. It is the kind of thing that happens when enough practitioners hit the same wall at the same time.

What Hashimoto actually means

The core idea is deceptively simple: every time an AI agent makes a mistake, you invest the effort to ensure it never makes that mistake again. Not by yelling at the model. By changing the environment around it.

Hashimoto breaks this into two mechanisms. First, implicit prompting through files like AGENTS.md, which are essentially rule books that live in your repo. He published Ghostty's version as an example, and each line in it traces back to a specific agent failure. Second, actual programmed tools: scripts that take screenshots, run filtered tests, verify behavior. The agent gets told these tools exist, and it uses them.

One commenter on Hacker News compared the approach to building up immunity through vaccinations. That's not a bad analogy. You encounter a bug in agent behavior, you write the antibody, and the system gets a little more resistant each time.

But here's the thing Hashimoto is honest about: he's only running background agents maybe 10 to 20 percent of a normal working day. The "always have an agent running" goal is still aspirational for him. And he's working on Ghostty, a terminal emulator, not exactly the most complex software domain. How does this scale?

OpenAI went further. A lot further.

OpenAI's experiment is more aggressive. A team of three engineers (later seven) used Codex to generate roughly a million lines of code across 1,500 pull requests over five months. The constraint they set for themselves: no manually written code. Zero. Humans would steer through prompts, review pull requests, and build the scaffolding that made agent work reliable. But they would not write code.

The claimed throughput is 3.5 PRs per engineer per day, and OpenAI says the team built in about a tenth of the time it would have taken to write the code by hand. Those numbers deserve some scrutiny. A million lines of code is easy to generate when you count documentation, tooling, tests, and configuration (which they explicitly do). And comparing against hypothetical manual timelines is always a bit self-serving. Still, the product apparently has internal daily users and external alpha testers.

What caught my attention was the honesty about early failures. Progress was slower than expected at the start, OpenAI admits, not because Codex couldn't code, but because the environment was underspecified. The agent lacked the abstractions and feedback loops it needed to make progress toward high-level goals. Sound familiar? It's Hashimoto's harness engineering, but at company scale.

Repository as the only source of truth

One architectural choice stands out in the OpenAI writeup. Because the agent can only see what's in the repo, the team had to push everything into versioned, repository-local artifacts. Slack discussions about architectural decisions? Useless to the agent. Google Docs with design specs? Invisible. If a piece of knowledge didn't exist in the repo as markdown, a schema, or executable code, it effectively didn't exist at all.

They even enforced architectural constraints through custom linters (themselves generated by Codex) that prevent code from depending on the wrong layers. This is the kind of rigid structure you normally postpone until you have hundreds of engineers. With agents, it's a prerequisite.

The skill formation problem nobody wants to talk about

Hashimoto, to his credit, directly addresses something most AI-coding evangelists skip. He links to an Anthropic research paper that found developers using AI assistance scored 17% lower on comprehension quizzes than those who coded by hand. That's nearly two letter grades, according to the researchers.

His proposed solution is splitting: delegate the boring tasks to agents while continuing to do manual deep work on the things you care about. "You're trading off," he writes, "not forming skills for the tasks you're delegating to the agent while continuing to form skills naturally in the tasks you continue to work on manually." It is a reasonable framework for a senior engineer who already has decades of fundamentals. For someone two years into their career? I'm less sure.

And OpenAI's model, where engineers literally never write code, pushes this tension further. If your job becomes designing environments and writing prompts and building feedback loops, are you still a software engineer? Or are you something else entirely that we don't have a name for yet?

The Fortune article about the recent GPT-5.3-Codex launch captured the mood. Spotify's co-CEO said their best developers haven't written a single line of code since December. Anthropic reported 70 to 90 percent of their own code is now AI-generated. Boris Cherny, who heads Claude Code at Anthropic, said he hasn't personally written code in over two months.

Two very different entry points, same conclusion

What makes the Hashimoto-OpenAI convergence interesting isn't just the shared terminology. It's that they arrived from opposite directions. Hashimoto is a self-described skeptic who forced himself through the painful early phases of adoption, doing his work twice (once manually, once with an agent) until he developed intuition for what agents were good at. OpenAI's team started with the radical constraint that humans would never touch the code, and then figured out what infrastructure was needed to make that work.

Both ended up in the same place: the engineer's job is to build the harness, not to write the code. The harness is the AGENTS.md files, the linters, the architectural constraints, the verification scripts, the test infrastructure. The code itself is just output.

Hashimoto acknowledged he might look back at this post and laugh at his own naivete. Given that OpenAI is already operating at a scale he hasn't attempted, that might happen sooner than he thinks. But his step-by-step documentation of the journey, from skeptic to cautious adopter to harness engineer, is more useful than OpenAI's polished writeup precisely because it includes the failures and the friction.

The question now isn't whether harness engineering is a real discipline. It clearly is. The question is whether the industry can train people for it fast enough, given that the traditional path of learning-by-coding may be partially foreclosed by the tools themselves. The Anthropic paper suggests this is not a hypothetical concern.

OpenAI's Codex team is growing. Hashimoto is building out his harness for Ghostty. Neither has published anything resembling a curriculum for how you actually get good at this. For now, the answer seems to be: stumble through it, document what breaks, and build the antibodies one failure at a time.