How to Set Up Agent-First Development for Large Codebases

QUICK INFO


Difficulty	Intermediate
Time Required	2-4 hours for initial setup, ongoing refinement
Prerequisites	Working knowledge of Git, CI/CD pipelines, and at least one AI coding agent An existing or new repository you want to make agent-driven Familiarity with pull request workflows and linting
Tools Needed	OpenAI Codex CLI or Codex cloud (GPT-5.x era models) Git and GitHub (or equivalent hosting) CI runner (GitHub Actions, CircleCI, etc.) A linter and formatter appropriate to your stack

What You'll Learn:

How to structure a repository so AI agents can reason about it without human help
How to write an AGENTS.md that works as a map rather than an encyclopedia
How to set up automated review loops that replace most manual code review
How to manage tech debt in a codebase where agents generate all the code

This guide covers how to restructure your engineering workflow so AI coding agents do the actual code generation while humans focus on architecture, intent, and feedback loops. It draws heavily from OpenAI's Harness article, where a small team shipped roughly a million lines of agent-generated code in five months with zero manually written code. The approach applies whether you use Codex, Claude Code, or another agent, though the specifics here lean toward the Codex ecosystem.

You need some experience running AI coding agents on real tasks, not just toy prompts. If you've never used Codex CLI or a comparable tool on a multi-file project, start there first.

What Changed About the Engineering Role

The OpenAI Harness team's core claim is that engineers stopped writing code entirely. Their job became designing environments, specifying intent, and building the feedback loops that let agents produce reliable output. On a team of seven engineers generating around 3.5 merged PRs per person per day, nobody touched application code directly.

The practical shift: when something broke, the response was never "try the prompt again." Instead, the question was "what capability is the agent missing, and how do we make the system legible enough that the agent can succeed next time?" That distinction matters. You're not debugging prompts. You're debugging the environment the agent operates in.

I should clarify that this level of agent autonomy took months to reach. They started with simple building blocks and gradually increased complexity. Trying to hand an agent full autonomy on day one will produce garbage. The progression from simple to complex is non-negotiable.

Getting Started: Make Your Repository Agent-Legible

The single biggest insight from the Harness project: from the agent's perspective, anything it can't access in-context doesn't exist. That Slack thread where your team decided on an architectural pattern? If it's not in the repo, the agent will never know about it. That design decision from a video call? Same problem.

Step 1: Audit What Lives Outside Your Repo

Go through your team's last two weeks of decisions. Check Slack, email, meeting notes, Notion, Google Docs. Any technical decision, convention, or constraint that isn't committed to the repository is invisible to your agents. Make a list. This is your migration backlog.

Step 2: Create a docs/ Directory Structure

Don't dump everything into one file. The Harness team found that a structured docs/ directory works far better than a monolithic document. At minimum, create these:

docs/
  architecture.md    # Domain map, package layering, boundaries
  conventions.md     # How code is written here (style, patterns, naming)
  decisions/         # Architecture Decision Records (ADRs)
  quality.md         # Grades each domain/layer, tracks known gaps

The architecture.md file should describe your system's domains and how they relate. Not a novel. A map. Think two pages maximum that an agent can scan to understand where a new feature should live and what it should depend on.

Expected result: When you point an agent at a task like "add rate limiting to the API gateway," it should be able to find the relevant domain, understand the package boundaries, and know which patterns to follow, all from what's in the repo.

Step 3: Write Your AGENTS.md (Keep It Short)

The Harness team's first approach was a big AGENTS.md file packed with every instruction they could think of. It failed. The problem is straightforward: context is a finite resource. A massive instruction file crowds out the actual task, the relevant code, and the documentation the agent needs to reference.

Keep your root AGENTS.md to roughly 100 lines. It should function as a table of contents, not an encyclopedia. Point to your docs/ directory for the details. Here's the structure that works:

# AGENTS.md

## Project Overview
One paragraph describing what this system does.

## Quick Start
Build: `make build`
Test: `make test`
Lint: `make lint`

## Architecture
See docs/architecture.md for domain map and package layering.

## Conventions
See docs/conventions.md for code style and patterns.

## Key Constraints
- No new dependencies without updating docs/decisions/
- All public APIs require integration tests
- Database migrations must be backward-compatible

For monorepos, nest additional AGENTS.md files in subdirectories. Codex (and most agents) read the nearest file in the directory tree, so subdirectory files can override or extend the root instructions.

Setting Up Automated Enforcement

This is where the Harness approach diverges from how most teams think about AI code generation. Their philosophy: architectural standards and code conventions are enforced mechanically, not through human review. Linters, CI checks, and automated review catch bad patterns. Humans intervene by exception, not by default.

The reasoning is practical. When agents produce 3+ PRs per engineer per day, manual review of every PR is a bottleneck that kills the throughput advantage. But without enforcement, agents will replicate whatever patterns exist in the codebase, including the bad ones. Once a suboptimal pattern gets committed, it becomes a template for future generations.

Step 4: Configure Your Linter as an Architectural Boundary

Your linter isn't just for style. Use it to encode architectural rules. If package A should never import from package B, make that a lint error. If certain functions should only be called from specific modules, encode that too. The Harness team's rule was: when documentation falls short, promote the rule into code.

In practice, this means your linter config grows significantly. That's fine. Each rule is a guardrail that prevents an agent from introducing a pattern you'll spend hours cleaning up later.

Step 5: Set Up Agent-Driven Review Loops

The Harness team's merge workflow: the agent generates a PR, then runs self-review. It requests additional agent reviews (both local and cloud-based). It responds to feedback, iterates, and only flags a human when the automated reviewers can't reach consensus or when the change touches a sensitive area.

You can approximate this without Codex's specific tooling. Set up your CI pipeline to run lint, type checks, test suites, and a second agent pass that reviews the diff against your docs/conventions.md. If all checks pass, the PR is merge-ready. A human can spot-check, but they're not the primary gate.

I haven't tested this fully with Claude Code's review capabilities, so I can't say how well it compares to Codex's built-in review loop. The concept is the same either way.

The Decomposition Pattern

You don't prompt an agent with "build me a user authentication system." You break it down: first build the token validation utility, then the session store interface, then the middleware that wires them together. Each piece is a self-contained task the agent can complete and test independently.

The Harness team's approach was explicitly bottom-up. Build small blocks. Verify they work. Then compose them into larger features. Each new level of complexity relies on abstractions the agent has already successfully built. This is how they gradually increased agent autonomy without everything falling apart.

There's nothing novel about this decomposition style. It's how good engineers have always worked. The difference is that you're decomposing for an agent's benefit, not your own. The granularity needs to be finer, the interfaces more explicit, and the expected outputs more clearly defined.

Managing Entropy: The Friday Garbage Collection

Full agent autonomy introduces drift. Agents replicate existing patterns, including ones that were acceptable last month but no longer match your current standards. The Harness team initially tried manual cleanup, spending every Friday (20% of the week) fixing "AI slop." That didn't scale.

Their solution: automated refactoring runs. Think of it as garbage collection for your codebase. A scheduled agent task scans for patterns that violate current conventions and generates refactoring PRs. The team runs this weekly.

This works because you've already encoded your conventions in documentation and linter rules. The refactoring agent reads the same docs/conventions.md as the code-generating agents, and any divergence between the codebase and the conventions becomes a cleanup target.

One thing the Harness team admits they don't know yet: how architectural coherence holds up over years in a fully agent-generated system. They're five months in. The long-term question is open.

Troubleshooting

Agent ignores your AGENTS.md instructions: Check the file size. If your AGENTS.md exceeds a few hundred lines, the agent may be running out of context budget for the actual task. Move detailed instructions to docs/ and keep AGENTS.md as a pointer. Also verify the file is in the right directory; agents typically read the nearest AGENTS.md in the directory tree, not the root one if you're working in a subdirectory that has its own.

Agent introduces wrong architectural patterns: This usually means your docs/architecture.md doesn't cover the domain the agent is working in, or your linter doesn't enforce the boundary. Check both. If the pattern is spreading, run a targeted refactoring pass immediately rather than letting it compound.

PRs pass CI but the code is subtly wrong: Your test coverage probably has gaps in the domain the agent is modifying. Agents are good at making tests pass, including by writing tests that don't actually verify the right behavior. Review test quality periodically, not just test existence. This is one area where human judgment still has the most leverage.

Agent-generated code works but is hard to understand: The Harness team's position is that agent-generated code doesn't need to match human stylistic preferences, as long as it's correct, maintainable, and legible to future agent runs. If readability is a concern for human reviewers in specific areas (critical business logic, security-sensitive code), flag those directories in AGENTS.md with a "requires human review" annotation.

Applying This in 2026

AI model quality is improving quarterly. The principles in this guide (repo legibility, mechanical enforcement, decomposition, automated review) are designed to compound with those improvements. Better models mean your existing AGENTS.md and docs/ structure produce even better code without you changing anything.

The main thing to watch: as models get better at handling larger context windows and longer tasks, you can increase the granularity of what you delegate. Tasks that required decomposing into five sub-tasks today might be a single prompt six months from now. But the infrastructure (documentation, CI enforcement, review loops) doesn't change. It just gets more leverage.

You now have a repo structured for agent-first development. The next step is running your first end-to-end feature through the pipeline: write the spec in docs/, create the task decomposition, let the agent generate and review, and merge through your automated gates. Start small. One feature. See what breaks. Fix the environment, not the prompt.

Set Up Agent-First Development for Large Codebases Using OpenAI's Harness Playbook

QUICK INFO

What Changed About the Engineering Role

Getting Started: Make Your Repository Agent-Legible

Step 1: Audit What Lives Outside Your Repo

Step 2: Create a docs/ Directory Structure

Step 3: Write Your AGENTS.md (Keep It Short)

Setting Up Automated Enforcement

Step 4: Configure Your Linter as an Architectural Boundary

Step 5: Set Up Agent-Driven Review Loops

The Decomposition Pattern

Managing Entropy: The Friday Garbage Collection

Troubleshooting

Applying This in 2026

Trần Quang Hùng

Related Articles

Karpathy Says the Era of Typing Code Is Over, Points to December as the Tipping Point

Apple Ships Xcode 26.3 With Built-In Claude and Codex AI Agents

Microsoft Launches Copilot Tasks to Automate Everyday Work

Stay Ahead of the AI Curve