How to Build Production-Grade Agentic AI Workflows

QUICK INFO


Difficulty	Intermediate to Advanced
Time Required	45-60 minutes to read; 2-4 weeks to implement
Prerequisites	Working knowledge of LLM APIs, Python or JavaScript, basic containerization concepts
Tools Needed	LLM API access (OpenAI, Anthropic, or Google), Docker, Kubernetes (optional), GitHub account

What You'll Learn:

How to decompose complex tasks into single-responsibility agents
When to use direct function calls versus MCP tool integration
How to implement a multi-model consortium for bias reduction
How to deploy containerized agentic workflows to production

This guide covers the architecture and implementation of production-quality multi-agent AI systems based on research from Old Dominion University, Deloitte, and Nanyang Technological University. The focus is on practical engineering patterns rather than theoretical concepts. You should already understand how to call LLM APIs and build basic agent loops.

Getting Started

Production agentic AI differs from prototype scripts in three ways: determinism (same input produces same output), observability (you can trace why an agent made a decision), and maintainability (you can update one component without breaking others).

The reference implementation uses a podcast-generation workflow that:

Scrapes news from RSS feeds
Filters content by topic relevance
Generates scripts using multiple LLMs
Consolidates output through a reasoning agent
Produces audio and video files
Publishes results to GitHub automatically

This architecture demonstrates all nine best practices in a single production system.

Architecture Overview

The workflow chains specialized agents in sequence:

User Input → Web Search Agent → Topic Filter Agent → Web Scrape Agent
    → Podcast Script Agents (multiple LLMs) → Reasoning Agent
    → Audio/Video Script Agent → TTS/Video Generator → GitHub PR

Each agent handles exactly one task. The orchestration layer manages handoffs between agents using deterministic logic rather than LLM-based routing.

Best Practice 1: Prefer Tool Calls Over MCP

Model Context Protocol (MCP) provides standardized communication between agents and external services. However, MCP introduces abstraction layers that reduce determinism.

When MCP Fails

The research team initially used the GitHub MCP server to create pull requests. They observed:

Ambiguous tool-selection decisions
Inconsistent parameter inference
Non-deterministic MCP responses that varied between runs

Despite repeated refinement of agent instructions, the behavior remained unstable with flickering failures.

The Fix

Replace MCP integration with direct function calls that agents invoke explicitly.

Before (MCP-based):

# Agent must interpret MCP tool definitions and reason through metadata
agent.configure_mcp_server("github-mcp-server")
agent.invoke_mcp_tool("create_pull_request", params)

After (Direct tool call):

# Explicit function with clear parameters
def create_github_pr(repo: str, branch: str, title: str, body: str) -> dict:
    # Direct GitHub API call
    return github_client.create_pr(repo, branch, title, body)

# Agent calls function directly
result = create_github_pr(
    repo="org/podcast-output",
    branch="episode-2024-12-11",
    title="New Episode: AI News Roundup",
    body=script_content
)

Expected result: The PR creation step becomes deterministic. Failures produce clear error messages instead of ambiguous MCP responses.

When to Use MCP

MCP remains appropriate when:

You need standardized access for multiple MCP-enabled clients (Claude Desktop, VS Code, LM Studio)
The external service has no direct API
You're building a platform where users bring their own integrations

For internal workflow operations where you control both ends, direct calls are more reliable.

Best Practice 2: Use Direct Function Calls Over Agent Tool Calls

Even with direct tools (not MCP), tool calls require the LLM to parse instructions, interpret parameters, and map natural language to function arguments. For operations that do not require language reasoning, this overhead is unnecessary.

Operations That Don't Need LLM Reasoning

Posting data to an API
Committing files to GitHub
Writing to databases
Generating timestamps
File system operations
HTTP requests

Implementation Pattern

Before (Agent with tool):

class PRAgent:
    tools = [create_github_pr_tool]
    
    def run(self, script_content):
        # LLM must reason about tool parameters
        # Token overhead + potential for misinterpretation
        return self.call_model_with_tools(
            f"Create a PR with this content: {script_content}"
        )

After (Pure function in orchestration layer):

class WorkflowController:
    def publish_results(self, script_content):
        # Direct function call - deterministic, testable, cheap
        return create_github_pr(
            repo=self.config.output_repo,
            branch=f"episode-{date.today()}",
            title=self.generate_title(script_content),
            body=script_content
        )

Expected result: The workflow controller handles infrastructure operations directly. LLM agents focus on tasks that require language understanding.

Decision Framework

Ask: "Does this step require the LLM to understand, reason, or generate language?"

Yes: Use an agent with tool access
No: Use a pure function in the orchestration layer

Best Practice 3: One Agent, One Tool

Attaching multiple tools to a single agent increases prompt complexity. The model must first decide which tool to invoke, then structure parameters correctly. This creates two failure modes instead of one.

Observed Failure Patterns

The research team designed an agent with two tools: scrape_markdown and publish_markdown. The intent was to scrape webpage content and publish the extracted markdown for audit purposes.

During evaluation:

The agent sometimes invoked only one tool
Sometimes invoked them in the wrong order
Sometimes failed to call either tool, especially with larger inputs

Decomposition Pattern

Before (Multi-tool agent):

class ContentAgent:
    tools = [scrape_markdown, publish_markdown]
    
    prompt = """
    You have access to two tools:
    1. scrape_markdown - Extract content from URL
    2. publish_markdown - Save content to storage
    
    For each URL, scrape the content and then publish it.
    """

After (Single-tool agents):

class ScraperAgent:
    tools = [scrape_markdown]
    prompt = "Extract markdown content from the provided URL."

class PublisherAgent:
    tools = [publish_markdown]  
    prompt = "Publish the provided markdown to storage."

# Orchestration handles sequencing
def process_url(url):
    content = scraper_agent.run(url)
    publisher_agent.run(content)

Expected result: Each agent has exactly one decision to make (how to parameterize its single tool). Tool-selection ambiguity disappears.

Best Practice 4: Single-Responsibility Agents

Separate from tool count, each agent should handle one conceptual responsibility. When an agent must generate, validate, transform, and execute in the same step, prompting becomes complex and failures become opaque.

Case Study: Veo-3 Video Generation

An early design combined video prompt generation and video creation in one agent. The agent received a script and was instructed to:

Transform the script into Veo-3 JSON specification
Generate the corresponding video

This blurred planning (designing the video prompt) and execution (calling the Veo API).

Observed failures:

Malformed JSON output
Mixed natural language with JSON
Hallucinated file paths and status messages about generation that hadn't occurred

Decomposition

Veo JSON Builder Agent:

class VeoJSONBuilderAgent:
    """
    Single responsibility: Transform script into valid Veo-3 JSON.
    Output contract: Always returns valid JSON, nothing else.
    """
    
    prompt = """
    Convert the provided script into a Veo-3 JSON specification.
    
    Output ONLY valid JSON with this structure:
    {
        "scenes": [...],
        "timing": {...},
        "style": {...}
    }
    
    Do not include explanations or status messages.
    """

Video Generation Function (not an agent):

def generate_video(veo_json: dict) -> str:
    """
    Deterministic function that calls Veo API.
    Handles retries, error checking, file storage.
    Returns path to generated MP4.
    """
    response = veo_client.generate(veo_json)
    video_path = save_video(response.video_data)
    return video_path

Expected result: The agent's output is always parseable JSON. Side effects (API calls, file storage) happen in testable, deterministic code.

Best Practice 5: Store Prompts Externally

Embedding prompts in source code creates tight coupling between agent behavior and application deployment. Changes to prompts require code deployments. Non-technical stakeholders cannot iterate on agent instructions.

External Prompt Management

Store prompts in a dedicated repository, configuration service, or shared drive. Load them at runtime.

Project structure:

prompts/
├── agents/
│   ├── web-search-agent.md
│   ├── topic-filter-agent.md
│   ├── podcast-script-agent.md
│   ├── reasoning-agent.md
│   └── veo-builder-agent.md
└── config/
    └── prompt-versions.yaml

Runtime loading:

class PromptManager:
    def __init__(self, repo_url: str):
        self.repo = GitHubClient(repo_url)
    
    def get_prompt(self, agent_name: str, version: str = "latest") -> str:
        return self.repo.get_file(f"agents/{agent_name}.md", ref=version)

# Agent initialization
prompt_manager = PromptManager("org/workflow-prompts")
podcast_agent = Agent(
    prompt=prompt_manager.get_prompt("podcast-script-agent")
)

Expected result: Domain experts can update prompts through pull requests. You can A/B test prompt variations without code changes. Rollback is a git revert.

Governance Workflows

External prompts enable:

Review processes before prompt changes go live
Version pinning for reproducibility
Controlled access through repository permissions
Audit trails of who changed what and when

Best Practice 6: Multi-Model Consortium for Responsible AI

Single-model outputs suffer from hallucinations, reasoning inconsistencies, and biases specific to that model's training. A multi-model consortium generates diverse outputs that a reasoning agent then consolidates.

Consortium Architecture

                    ┌─────────────────┐
                    │   Input Data    │
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          ▼                  ▼                  ▼
   ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
   │   Claude    │    │   GPT-4     │    │   Gemini    │
   │   Agent     │    │   Agent     │    │   Agent     │
   └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
          │                  │                  │
          └──────────────────┼──────────────────┘
                             ▼
                    ┌─────────────────┐
                    │  Reasoning LLM  │
                    │  (Consolidation)│
                    └────────┬────────┘
                             ▼
                    ┌─────────────────┐
                    │  Final Output   │
                    └─────────────────┘

Implementation

Parallel script generation:

async def generate_scripts(content: str) -> list[str]:
    tasks = [
        claude_agent.generate(content),
        gpt_agent.generate(content),
        gemini_agent.generate(content)
    ]
    return await asyncio.gather(*tasks)

Reasoning agent prompt:

You are a consolidation agent. You receive multiple drafts of the same 
content from different AI models.

Your task:
1. Compare all drafts for factual consistency
2. Identify claims that appear in multiple drafts (high confidence)
3. Flag claims that appear in only one draft (verify or remove)
4. Resolve contradictions by favoring the most conservative claim
5. Remove speculation not grounded in the source material
6. Produce a single unified output

Do not add new information. Only synthesize what the drafts provide.

Consolidation:

def consolidate(drafts: list[str], source_content: str) -> str:
    return reasoning_agent.run(
        f"""
        Source material:
        {source_content}
        
        Draft 1 (Claude):
        {drafts[0]}
        
        Draft 2 (GPT):
        {drafts[1]}
        
        Draft 3 (Gemini):
        {drafts[2]}
        
        Produce a consolidated script following your instructions.
        """
    )

Expected result: The final output reflects cross-model agreement. Single-model hallucinations get filtered out during consolidation.

Benefits

Higher accuracy through consensus
Reduced bias by incorporating diverse model behaviors
Robustness to model updates (if one model degrades, others compensate)
Auditability (you can trace which models agreed on which claims)

Best Practice 7: Separate Workflow Engine from MCP Server

To expose workflows to MCP-enabled clients (Claude Desktop, VS Code, LM Studio), serve the workflow via REST API and use the MCP server as a thin adapter.

Architecture

┌─────────────────────────────────────────────────────────┐
│                    MCP Clients                          │
│  (Claude Desktop, VS Code, LM Studio)                   │
└─────────────────────┬───────────────────────────────────┘
                      │ MCP Protocol
                      ▼
┌─────────────────────────────────────────────────────────┐
│                    MCP Server                           │
│  (Thin adapter - forwards calls to REST API)            │
└─────────────────────┬───────────────────────────────────┘
                      │ HTTP/REST
                      ▼
┌─────────────────────────────────────────────────────────┐
│                  Workflow REST API                      │
│  /api/v1/workflow/start                                 │
│  /api/v1/workflow/status                                │
│  /api/v1/workflow/results                               │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              Workflow Engine                            │
│  (Agent orchestration, tool integration)                │
└─────────────────────────────────────────────────────────┘

MCP Server Implementation

# mcp_server.py - Thin adapter only
from mcp import Server, Tool
import httpx

server = Server("podcast-workflow")
api_client = httpx.Client(base_url="http://workflow-api:8000")

@server.tool("generate_podcast")
def generate_podcast(topic: str, sources: list[str]) -> dict:
    """MCP tool that forwards to workflow API"""
    response = api_client.post(
        "/api/v1/workflow/start",
        json={"topic": topic, "sources": sources}
    )
    return response.json()

@server.tool("check_status")
def check_status(workflow_id: str) -> dict:
    response = api_client.get(f"/api/v1/workflow/status/{workflow_id}")
    return response.json()

Expected result: The MCP server stays simple and stable. The workflow engine can iterate rapidly without affecting the MCP interface. Components scale independently.

Best Practice 8: Containerized Deployment

Package the workflow engine and MCP server in Docker containers. Orchestrate with Kubernetes for production deployments.

Dockerfile for Workflow Engine

FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY workflow/ ./workflow/
COPY agents/ ./agents/
COPY config/ ./config/

EXPOSE 8000

CMD ["uvicorn", "workflow.api:app", "--host", "0.0.0.0", "--port", "8000"]

Kubernetes Deployment

# workflow-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: podcast-workflow
spec:
  replicas: 3
  selector:
    matchLabels:
      app: podcast-workflow
  template:
    metadata:
      labels:
        app: podcast-workflow
    spec:
      containers:
      - name: workflow
        image: org/podcast-workflow:v1.2.0
        ports:
        - containerPort: 8000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: openai-key
        - name: ANTHROPIC_API_KEY
          valueFrom:
            secretKeyRef:
              name: llm-secrets
              key: anthropic-key
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: podcast-workflow
spec:
  selector:
    app: podcast-workflow
  ports:
  - port: 80
    targetPort: 8000

Expected result: The workflow scales automatically based on load. Failed pods restart. Secrets are managed securely. Rolling updates deploy without downtime.

Operational Benefits

Portability: Runs identically in dev, staging, and production
Scalability: Kubernetes auto-scales based on CPU/memory or custom metrics
Resilience: Health checks and automatic pod restarts
Security: Network policies, RBAC, secret management
Observability: Integration with Prometheus, Grafana, OpenTelemetry

Best Practice 9: Keep It Simple

Agentic workflows delegate reasoning to LLMs. They do not need complex internal architectures. Avoid:

Deep inheritance hierarchies
Microservice-like decomposition within the workflow
Multiple indirection layers
Design patterns that add abstraction without value

What Simple Looks Like

Simple workflow structure:

workflow/
├── main.py              # Entry point
├── agents.py            # Agent definitions (one per agent)
├── tools.py             # Pure functions
├── orchestrator.py      # Sequential agent coordination
└── config.py            # Settings

Simple orchestration:

# orchestrator.py
async def run_podcast_workflow(topic: str, sources: list[str]) -> dict:
    # Step 1: Search
    articles = await web_search_agent.run(sources)
    
    # Step 2: Filter
    relevant = await topic_filter_agent.run(articles, topic)
    
    # Step 3: Scrape
    content = await scrape_content(relevant)  # pure function
    
    # Step 4: Generate scripts (parallel)
    drafts = await generate_scripts(content)
    
    # Step 5: Consolidate
    final_script = await reasoning_agent.run(drafts)
    
    # Step 6: Generate media
    audio_path = await generate_audio(final_script)  # pure function
    video_json = await veo_builder_agent.run(final_script)
    
    # Step 7: Publish
    pr_url = create_github_pr(final_script, audio_path, video_json)
    
    return {"script": final_script, "audio": audio_path, "pr": pr_url}

Expected result: Anyone can read the orchestrator and understand the workflow in under a minute. AI coding assistants can modify the code accurately because the structure is flat and obvious.

Why Simplicity Matters for AI-Assisted Development

Modern AI coding tools (Claude, GitHub Copilot) perform better on simple codebases:

Fewer files to track in context
Clear function boundaries
Obvious data flow
No hidden state in nested abstractions

A simple workflow gets better suggestions from AI tools and requires less manual debugging.

Troubleshooting

Symptom: Agent invokes wrong tool or no tool at all Fix: Check if the agent has multiple tools attached. Split into single-tool agents. Verify the prompt clearly describes when to use the tool.

Symptom: MCP tool calls return inconsistent results across runs Fix: Replace MCP integration with direct function calls for that operation. Reserve MCP for external client access only.

Symptom: Reasoning agent adds hallucinated content not in source drafts Fix: Strengthen the prompt constraint: "Do not add information. Only synthesize from the provided drafts." Consider adding a verification step that checks output against source material.

Symptom: Kubernetes pods crash during high load Fix: Check memory limits. LLM API calls can buffer large responses. Increase memory limits or implement streaming responses. Add horizontal pod autoscaling based on request queue depth.

Symptom: Prompt changes don't take effect Fix: Verify the prompt manager is fetching the correct version. Check for caching in the application. Restart pods if prompts are loaded at startup only.

What's Next

You now have a blueprint for production-grade agentic AI workflows. The reference implementation (podcast-generation workflow) is available in the GitLab repository.

For MCP server integration, see the companion MCP server repository.

PRO TIPS

Use asyncio.gather() for parallel agent execution when agents don't depend on each other's output. The podcast workflow runs three LLM agents simultaneously, reducing total latency from 3x to 1x the single-agent time.
Pin prompt versions in production configs (e.g., prompt_version: "v2.3.1"). This prevents unexpected behavior changes when someone updates prompts in the repository.
Add request IDs to every workflow execution and propagate them through all agent calls. This makes distributed tracing possible when debugging failures across multiple agents.
Set explicit timeouts on all LLM API calls. Without timeouts, a slow response blocks the entire workflow. The research implementation uses 30-second timeouts with automatic retry.
Log the complete prompt sent to each agent (not just the template). When an agent misbehaves, you need to see exactly what input it received, including dynamic content.

COMMON MISTAKES

Using MCP for internal operations: MCP adds latency and ambiguity. Use it only for external client access. Internal operations should use direct function calls or REST APIs.
Letting agents chain themselves: Avoid patterns where Agent A decides to call Agent B. The orchestration layer should control sequencing. Agent-to-agent calls create hidden dependencies that are hard to debug and impossible to test in isolation.
Skipping the reasoning agent for "simple" tasks: Even straightforward tasks benefit from multi-model consensus. A single model can confidently produce wrong output. The reasoning step catches errors that no single model would flag.
Over-engineering the orchestration layer: The workflow controller should be a flat sequence of function calls. If you're building state machines, dependency injection frameworks, or plugin architectures, you're adding complexity that makes debugging harder and AI coding assistants less effective.

PROMPT TEMPLATES

Topic Filtering Agent

You are a topic relevance classifier.

Given a list of article URLs and titles, return only those relevant to the 
specified topic. Be inclusive - if an article might be relevant, include it.

Topic: {{topic}}

Articles:
{{articles_json}}

Return a JSON array of relevant URLs only:
["url1", "url2", ...]

Customize by: Adjusting the topic specificity. For broad topics like "AI news," use loose matching. For specific topics like "transformer architecture improvements," require explicit mentions.

Example output:

["https://example.com/ai-regulation-update", "https://example.com/new-llm-benchmark"]

Reasoning/Consolidation Agent

You consolidate multiple AI-generated drafts into a single authoritative output.

Instructions:
1. Identify claims present in 2+ drafts (high confidence - include)
2. Flag claims in only 1 draft (low confidence - verify against source or remove)
3. Resolve contradictions by choosing the more conservative statement
4. Remove speculation not supported by source material
5. Maintain consistent tone and structure

Source material:
{{source_content}}

Draft 1 ({{model_1_name}}):
{{draft_1}}

Draft 2 ({{model_2_name}}):
{{draft_2}}

Draft 3 ({{model_3_name}}):
{{draft_3}}

Produce a consolidated version. Do not add new information.

Customize by: Adding domain-specific verification rules. For medical content, require citations. For news, require date verification.

Example output: A unified script that contains only claims supported by multiple models, with speculative statements removed.

FAQ

Q: How many LLMs should be in the consortium? A: Three is the practical minimum for meaningful consensus. Five provides better coverage but increases cost and latency. The research used three (Claude, GPT, Gemini) for a balance of diversity and efficiency.

Q: Can I use the same model family for all consortium agents? A: Using the same model (e.g., three GPT instances) provides no diversity benefit. The value comes from different training data and reasoning approaches across model families.

Q: How do I handle rate limits across multiple LLM providers? A: Implement exponential backoff with jitter at the agent level. The orchestrator should catch rate limit errors and retry with delays. For sustained high volume, use separate API keys per agent or queue requests.

Q: What if the reasoning agent disagrees with all input drafts? A: This indicates a prompt or source material problem. The reasoning agent should not generate novel content. If it's rejecting all drafts, check that the source material actually supports the topic and that draft agent prompts are correctly constrained.

Q: Is Kubernetes required for production deployment? A: No, but it simplifies scaling and operations. You can run containerized workflows on any Docker host. Kubernetes becomes valuable when you need auto-scaling, rolling updates, and multi-replica deployments.

Q: How do I version prompts alongside code? A: Store prompts in a separate repository with semantic versioning. Pin prompt versions in your workflow config file. When deploying a new workflow version, specify which prompt versions it requires. This decouples prompt iteration from code deployments.

RESOURCES

Original Research Paper: A Practical Guide for Designing, Developing, and Deploying Production-Grade Agentic AI Workflows: The academic paper this guide is based on, with detailed evaluation results and additional architectural diagrams.
Podcast Workflow Implementation: Complete source code for the reference workflow including agent definitions, tool functions, and Kubernetes manifests.
MCP Server Implementation: The MCP adapter that exposes the workflow to MCP-enabled clients.
OpenAI Agents SDK: The SDK used in the reference implementation for agent orchestration.
Model Context Protocol Specification: Official MCP documentation for building MCP servers and understanding the protocol.