How to Use AI Coding Assistants for Large Codebase Projects

QUICK INFO


Difficulty	Intermediate
Time Required	60-90 minutes to set up; weeks for full project execution
Prerequisites	Working knowledge of Git, basic command line, one programming language
Tools Needed	ChatGPT+ with Codex access, Claude Pro or Claude+, Gemini Pro subscription (optional)

What You'll Learn:

Set up agents.md and plans.md files that keep AI assistants aligned across sessions
Choose the right AI model for different task types
Build test infrastructure that catches AI-generated regressions
Manage context limits without losing project coherence

This guide covers how to coordinate multiple AI coding assistants on projects too large for a single session. The approach comes from a real library conversion that would have cost roughly $30,000 in developer time but ran about $30 in AI subscriptions over two weeks.

If you've bounced off AI coding because it falls apart on anything beyond a single file, this is for you. If you're already using Claude Code daily and want to add other models to your rotation, skip to the model selection section.

Getting Started

You need active subscriptions to at least two AI coding platforms. The combination that worked for this project: ChatGPT+ (for Codex), Claude Pro (for Claude Code and Opus 4.5), and Gemini Pro (for frontend work and large context tasks). You can start with just one, but the workflow really benefits from switching between models for different task types.

Create a project directory with this structure:

project/
├── agents.md
├── plans.md
├── src/
├── tests/
└── docs/
    └── reference/   # Original code, specs, examples

The agents.md and plans.md files are the backbone of everything. I'll cover exactly what goes in them, but the short version: agents.md describes your project and its conventions; plans.md tracks what's done and what's next. Every AI session starts by reading these files.

Setting Up the Documentation Layer

agents.md

This file tells every AI assistant what it's working on. Start broad, then add specifics as you hit friction. Mine started at 200 lines and ended at maybe 80, after I realized shorter was better and moved detailed specs elsewhere.

# Project: [Name]

## Overview
[2-3 sentences on what this project is and what you're building]

## Architecture
[Where code lives, how it's organized, key patterns]

## Conventions
- Language: TypeScript (strict mode)
- Testing: Jest with coverage >80%
- Naming: camelCase for functions, PascalCase for classes
[Add conventions as you notice AI making wrong assumptions]

## File Locations
- Source: /src
- Tests: /tests
- Reference implementation: /docs/reference/[original code]
- Detailed specs: /docs/specs/

## Current Focus
See plans.md for active tasks.

The "Conventions" section grows organically. Every time an AI writes code that doesn't match your style, add a line. Every time it makes a wrong assumption about file structure, add a line.

plans.md

This is your task tracker. OpenAI has a good writeup on this format at their cookbook (search for "codex exec plans"). The structure:

# Implementation Plan

## Completed
- [x] Task 1: Set up project structure
- [x] Task 2: Port core data types

## In Progress
- [ ] Task 3: Port validation logic
  - [x] Subtask 3.1: Basic type validation
  - [ ] Subtask 3.2: Cross-field validation
  
## Upcoming
- [ ] Task 4: Port API layer
- [ ] Task 5: Integration tests

After every AI session, update this file. The AI can help, but verify the updates yourself. I learned this after an enthusiastic model marked three tasks complete that were actually half-done.

Choosing Models for Different Tasks

Not all AI coding assistants are the same. This isn't comprehensive because I haven't tested everything extensively, but here's what I observed.

Codex 5.2 through ChatGPT+ handled the overall planning and long sequential work best. When I had a 30-step migration plan and needed someone to follow it methodically, Codex was the one. It seems to handle very large context without degrading the way other models do. The catch: you need extremely clear specs. Ambiguity kills it.

Opus 4.5 (via Claude Code or just in the Claude interface) wrote the cleanest code for well-scoped tasks. If I could describe what I needed in under 500 words, Opus usually got it right the first time. Context management matters here. I tried to keep conversations under 30% of the context window. After that, quality dropped noticeably, though I haven't measured this precisely.

Gemini 3.0 Pro surprised me on frontend work and on tasks requiring comparison of large files. It has a massive context window and uses it well. When I needed to compare a 3,000-line Java file against its TypeScript port to find discrepancies, Gemini was faster and more accurate than the others. Gemini Flash is good for quick checks where you can't afford to break anything, but it makes more mistakes than the full models.

I should clarify: "best for X" doesn't mean the others can't do X. It means when I had limited tokens and needed reliability, that's what I reached for.

Building Test Infrastructure

Tests are what make AI coding work at scale. Without them, you're debugging AI-generated code by reading it, which defeats the purpose.

Step 1: Get Reference Data

If you're porting existing code, run the original and capture inputs and outputs. For new projects, write expected behavior specs before coding.

# Example: capturing test cases from Java library
java -jar original.jar --test-mode > test-cases.json

This gives you ground truth to test against. Store it in /docs/reference/.

Step 2: Generate Test Scaffolding

Ask the AI to generate tests based on your reference data, but be specific:

Read test-cases.json. For each case, generate a Jest test that:
1. Uses the input from the test case
2. Calls the corresponding function in our TypeScript implementation
3. Compares output to expected result
4. Reports which field differs if assertion fails

Review the generated tests. AI tends to write overly verbose test descriptions and sometimes misunderstands the assertions. Trim what you don't need.

Step 3: Add Linting and Type Checking

Run these on every change:

// package.json scripts
{
  "lint": "eslint src --ext .ts",
  "typecheck": "tsc --noEmit",
  "test": "jest",
  "verify": "npm run lint && npm run typecheck && npm run test"
}

Tell the AI to run npm run verify after every change. Put this in agents.md. They will still sometimes forget.

Step 4: Add Pre-Commit Hooks

This catches what verbal instructions don't:

npx husky install
npx husky add .pre-commit "npm run verify"

Now broken code can't be committed regardless of what the AI does.

Managing Context Across Sessions

This is where most AI coding projects fail. You hit context limits, start a new conversation, and the AI has forgotten everything.

The documentation layer (agents.md and plans.md) solves part of this. But there's more.

Session Handoff Pattern

At the end of each session, ask the AI to summarize:

Summarize what we accomplished this session and what's left for the next session.
Format as updates to plans.md.
List any new conventions or gotchas for agents.md.

Copy this into your files. When you start the next session (same model or different), begin with:

Read agents.md and plans.md. Confirm you understand the project state and current task.

Wait for confirmation before continuing. This feels slow but saves debugging time.

Keeping Files Updated

I made mistakes here initially. I'd let agents.md grow to 500+ lines with detailed specs, and the AIs would lose track of what mattered. Now I keep agents.md under 100 lines and split details into /docs/specs/ files that I reference when relevant.

Plans.md updates after every completed task. No exceptions. It's annoying, but without it you'll spend the first 15 minutes of every session reconstructing state.

Working with Existing Codebases

When porting or modifying existing code:

Put reference files in /docs/reference/. Tell the AI to check them before writing new code. In agents.md, include:

## Reference Implementation
Original Java code is in /docs/reference/java/
Before implementing any function, read the corresponding Java implementation.
Match behavior exactly unless plans.md specifies a change.

For the library conversion project, I had the AI compare its output against the Java version after every function. This caught drift early.

Troubleshooting

AI marks tasks complete that aren't

Verify completions yourself. Add verification steps to plans.md: "Task complete when: all tests pass, types check, matches Java output for test cases 1-15."

Context runs out mid-task

Break tasks smaller. If a task can't be completed in one session, it's too big. Each plans.md task should take 15-30 minutes of AI time, not hours.

Different models make incompatible changes

Stick to one model per file or module when possible. If you must switch, start the new session with "Read the current state of [file]. Describe what you see before making changes."

AI writes code that doesn't match project style

Add specific examples to agents.md. Not "use consistent naming" but "function names: getUser, validateInput, not get_user, validateUserInput_v2."

Linting/tests break after AI edits

This happens. The workflow expects it. That's why verify runs after every change and pre-commit hooks block broken code.

What's Next

This workflow handles projects up to a few thousand files and multi-week timelines. Larger than that, you probably need proper project management tooling, not just markdown files.

For immediate next steps: Claude Code documentation covers the specific commands if you're using that tool.

PRO TIPS

The keyboard shortcut matters less than you'd think, but: in Claude Code, /clear resets context without losing your session. Use it when responses start getting weird.

If you're on a Mac, iTerm2's broadcast input lets you run the same command in multiple terminals. Useful for running tests while the AI works in another pane.

Gemini's file upload handles larger files than you'd expect. I uploaded a 15MB JSON test fixture directly rather than copying pieces, and it referenced it correctly throughout the conversation.

When switching between models mid-task, explicitly state what was decided in previous sessions. "The previous session decided to use a factory pattern for X. Continue with that approach." AIs don't always infer this from code.

FAQ

Q: Can I use free tiers instead of paid subscriptions? A: For small tasks, yes. For multi-week projects, rate limits make free tiers impractical. The workflow assumes you can run multiple queries in sequence without waiting.

Q: How do I know when to switch models? A: When the current model is struggling. If Opus is giving inconsistent outputs on frontend work, try Gemini. If Codex is losing track of a complex plan, try breaking it smaller rather than switching. There's no formula, just pattern matching over time.

Q: What if my project doesn't have tests? A: Write them first, or have the AI write them. Seriously. This workflow relies on automated verification. Without tests, you're manually reviewing every AI change, which scales poorly.

Q: How long before the AI needs reminders about project conventions? A: In my experience, around 20-30 exchanges in Claude Code before context starts competing with project details. Shorter conversations are more reliable. Start fresh sessions for distinct tasks.

Q: What's the cost breakdown for a project like the library conversion? A: ChatGPT+ at $20/month, Claude Pro at €20/month, Gemini Pro at €20/month. Two weeks of active work meant half a month of each, so roughly €30 total. The equivalent consulting work was quoted at $30,000+.

RESOURCES

OpenAI Codex Execution Plans: The plans.md format I adapted from
Claude Code documentation: Setup and commands for the Claude Code CLI