AGENT TEAMS — Building a Multi-Agent Pipeline on Top of AI Coding Tools

AI coding assistants have changed how we build software. Tools like Claude Code, Cursor, Windsurf, and OpenCode are genuinely impressive — they can write code, debug issues, refactor files, and handle complex tasks that would have been unthinkable a couple of years ago. I use them daily and they've made me significantly more productive.

But the more I pushed these tools on real, multi-disciplinary projects — the kind with dozens of deliverables, multiple skill domains, and weeks of execution — the more I ran into a recurring set of gaps. Not flaws in the tools themselves, but limitations of the single-agent, single-session paradigm.

AGENT TEAMS is my attempt to take the next step. It's a multi-agent orchestration framework that sits on top of an AI coding tool's infrastructure and adds the layers that turn a brilliant individual contributor into a managed team: structured planning, adversarial quality gates, crash-safe state persistence, cost-optimized model routing, and human-in-the-loop controls.

This post walks through the architecture, the pipeline, the state management system, and the thinking behind the design decisions.


Where Single-Agent Tools Leave Room to Grow

Context pressure on long projects. Even with large context windows, a single agent managing a 50-task project has to juggle requirements, execution state, file context, and decision history all at once. The more you load into one session, the harder it is for the model to maintain focus on the immediate task.

No structured planning phase. Most tools are optimized for the "give me a task, I'll do it now" workflow — which is great for daily work. But for a project with interdependent tasks, different skill domains, and a multi-phase timeline, there's value in having a planning stage that happens before execution begins.

Session-bound state. If your session crashes, gets disconnected, or times out during a long-running task, you typically lose the execution context.

Manual effort to optimize model usage. Most AI coding tools can run multiple models, but getting the right model on the right task requires manual work — switching models mid-session, crafting agent definitions, and tuning configurations per task type. It's possible, but it doesn't happen automatically. Across dozens of tasks spanning different domains, the overhead of doing this by hand means most people just run the default model for everything, leaving cost savings and quality gains on the table.

These aren't problems to solve within the existing tools — they're problems to solve on top of them. That's the layer AGENT TEAMS occupies.


The Core Idea: An Orchestration Layer, Not a Replacement

AGENT TEAMS wraps current agent tools. Under the hood, it uses an existing AI coding tool's SDK to programmatically start a server, create isolated sessions, send prompts, monitor events, and manage permissions. The framework adds three things that the underlying tool doesn't handle on its own:

  1. A structured, multi-stage pipeline with planning, auditing, execution, and learning phases
  2. A file-based state management system that survives crashes and enables resume-from-anywhere
  3. A multi-agent dispatch system that routes tasks to specialized agents running cost-appropriate models

Think of it like this: the AI coding tool is the engine. AGENT TEAMS is the assembly line that coordinates which engine does what, in what order, with what quality checks, and what happens when something breaks.


High-Level Architecture

The system is organized into three layers:

The Pipeline Runner

A Node.js script that acts as the sole authority for lifecycle control. It starts a headless server from the underlying AI tool, creates isolated sessions for each agent, sends prompts, monitors real-time events via SSE (Server-Sent Events), handles permission requests, persists state after every step, and makes decisions about what to do when an agent stops.

This is deterministic code, not an LLM. It doesn't hallucinate, doesn't lose context, and doesn't make probabilistic decisions about task ordering. It's a state machine with well-defined transitions.

Pipeline Agents

A set of specialized agents that manage the planning and quality infrastructure. These include agents for requirements gathering, project planning, adversarial plan auditing, plan refinement, execution support, supervision, and meta-learning. They run infrequently and their output quality cascades through the entire project — so they use higher-capability models.

Execution Agents

A roster of domain-specific agents that do the actual project work. The roster includes roles like developer, designer, content writer, SEO specialist, data analyst, DevOps engineer, QA engineer, social media manager, and more. Each agent has its own system prompt, tool access list, and a recommended model chosen for cost-efficiency in its domain.

The planner agent reads the full agent roster and automatically assigns the best-fit agent to each task based on the task's domain, required tools, and complexity.


The 6-Stage Pipeline

The pipeline transforms a high-level goal into executed deliverables through six stages. Each stage has defined inputs, outputs, and validation criteria.

Stage 1: Requirements Gathering

You provide your project goal and available resources as input files. A requirements agent reads everything and produces a structured requirements document with numbered IDs, acceptance criteria, priorities, and cross-references. This isn't a reformatted version of your input — the agent fills gaps, identifies ambiguities, and structures things so every downstream task can trace back to a specific requirement.

Stage 2: Planning

A planning agent reads the structured requirements and produces a detailed execution plan. Every task in the plan carries rich metadata: a unique ID, assigned agent, dependency list, priority, effort estimate, interaction mode (autonomous vs. human-required), failure handling strategy, acceptance criteria, required tools, expected inputs, and expected outputs.

The planner also produces a project context document (architectural decisions, risks, deliverables structure) and a toolbox manifest (which tools are actually available and verified in the current environment).

A typical plan might contain dozens of tasks across multiple phases, each with explicit dependency chains. The dependency graph is validated for cycles before execution begins.

Stage 3: Audit & Refinement

This is the stage that doesn't exist in the single-agent paradigm, and it's the one that makes the biggest difference in output quality.

Before any project work begins, an audit agent — powered by a reasoning-optimized model — adversarially attacks the plan. Are there missing dependencies? Unrealistic effort estimates? Tasks assigned to agents that don't have the right tools? Requirements that aren't covered by any task? The audit classifies findings by severity: Critical, Major, Minor, and Informational.

If significant issues are found, a refinement agent rewrites the affected plan sections. This audit-refine loop can run multiple cycles. If the audit finds no significant issues, refinement is automatically skipped — no wasted tokens or time.

Stage 4: Execution

In earlier versions of the architecture, execution was delegated to a single long-lived orchestrator agent. The problem? That orchestrator was itself an LLM — it could lose context, make poor dispatch decisions, or fail silently over a long session.

The current architecture makes the pipeline runner (deterministic code) the dispatcher. It:

  • Parses the plan to extract all tasks with their metadata
  • Builds and validates the dependency graph
  • Flags tasks that need human input as blockers
  • Enters a dispatch loop: selects the highest-priority ready task, creates a fresh session for the assigned agent, sends the task with full context, monitors progress, verifies outputs, updates the plan's status markers, and repeats

Each task runs in an isolated session with only the context it needs. No context bleeding between tasks, no accumulated confusion over a long-running session.

Stage 5: Supervision

For longer-running projects, a supervision agent monitors execution health. Are tasks completing on schedule? Are blockers accumulating? Has the project drifted from the plan? It can flag issues or trigger re-planning if things have gone significantly off-track.

Stage 6: Meta-Learning

After execution, evaluation and learning agents analyze what happened: which agents performed well, which tasks took longer than estimated, what failure patterns emerged. The learning agent proposes system-level improvements — better prompt templates, model reassignments, workflow optimizations. This creates a feedback loop that makes the system better over time.


The Intelligent Control Loop

When an agent finishes a task (or stops for any reason), the pipeline runner doesn't just check "success or failure." It classifies the stop reason into one of several categories and takes the appropriate action automatically:

  • Completed — Agent finished successfully. Move to the next task or stage.
  • Human required — The task needs human input. Pause the pipeline, log the blocker, and exit with a specific code so the operator knows what to do.
  • Blocked — A non-human blocker (missing tool, unavailable resource). Pause and log.
  • Scheduled — Tasks are scheduled for a future date. Pause and save the resume-after time.
  • Partial progress — The agent made progress but didn't finish (context exhaustion, rate limit, etc.). Re-invoke the same agent with a continuation prompt so it picks up where it left off without duplicating work.
  • Error / Timeout — Retry up to a configurable limit, then fail the task.
  • Phase gate halt — A quality gate returned "halt." Pause for review.

The partial progress handling is especially important for long tasks. Instead of losing everything when an agent hits a context limit, the runner detects the progress percentage, re-invokes the agent with instructions to continue, and the task completes across multiple invocations.


State Management: Files as the Database

The state system includes:

A master pipeline state file that records every step the pipeline has executed — run ID, status, configuration snapshot, step history with timing and cost data, and total accumulated cost. This file is persisted after every step using atomic writes.

A structured blocker tracker that records blockers with type classification (human-required, technical failure, missing input, missing tool, circular dependency, external dependency), resolution status, and timestamps. When the pipeline pauses, the operator checks this file to understand exactly what's needed.

An append-only audit log that records every permission request and tool invocation — agent name, permission category, tool, arguments (truncated), and the allow/deny decision. This is the accountability trail for unattended runs.

A pause/resume state file that captures why the pipeline paused, what action is needed, and how to resume. This works the same way regardless of the pause reason — human blocker, error, scheduled task, or phase gate halt.

A run history file that accumulates summaries across pipeline runs: costs, token counts, stage outcomes, failure patterns, retry counts. This enables the meta-learning agents to spot patterns and propose improvements.

Human-readable dashboard files — a system health file (current task, health status, agent availability), a progress file (task counts, phase progress, velocity metrics), an execution log (ordered task checklist), and a learning log (patterns and insights discovered during execution).


State Recovery: Designed for Interruption

The system is built around the assumption that things will go wrong. Sessions will crash. Terminals will disconnect. Rate limits will hit at the worst time. The recovery model handles all of this.

Atomic writes. Every state file update goes through a write-to-temp-then-rename pattern. On POSIX systems, the rename operation is atomic — the file is either the old version or the new version, never a corrupted in-between state.

Graceful shutdown. The pipeline runner intercepts interrupt signals (Ctrl+C, SIGTERM). Instead of crashing mid-write, it saves the current state as "interrupted" and exits cleanly. Re-running the pipeline detects the interrupted state and resumes from the exact step where it stopped.

Flexible resume controls. You can resume from where you left off (default), start completely fresh (which archives the previous run with a timestamp), or re-run from a specific stage onward (useful when you want to manually edit the plan and re-audit).

Archived snapshots. Previous runs are never deleted — they're moved to a timestamped archive directory. You can always go back and inspect what the system was doing in any previous run.


Cost-Optimized Model Routing

When you use a single AI tool, every task — from complex architectural reasoning to a simple file rename — runs through the same model at the same cost. AGENT TEAMS introduces a tiered model strategy:

Premium tier for pipeline agents that run infrequently but whose output quality affects the entire project. The planning and refinement agents need the best reasoning and instruction-following available.

Reasoning tier for agents that do analytical or adversarial work. The plan auditor needs a reasoning-optimized model to find gaps and inconsistencies.

Standard tier for most execution agents. Content writing, SEO analysis, UI work, and monitoring need good general capability at moderate cost.

Budget tier for code-heavy and high-frequency tasks. Coding-specialized models deliver excellent results at a fraction of the cost of general-purpose models. Coordination tasks that produce structured status reports can use the most affordable models available.

The result is that a full pipeline run — requirements through execution — costs a fraction of what it would if every agent used the same premium model. The right model for the right job, at the right price.


Human-in-the-Loop Controls

AGENT TEAMS is designed to amplify human capability, not replace human judgment. Several mechanisms keep you in control:

Plan approval gate. After the audit-refine loop, the pipeline can pause for your review. You can approve the plan, provide feedback that gets incorporated into a refinement pass, or reject it entirely.

Task-level interaction modes. Each task in the plan can be marked as autonomous (runs without intervention), assisted (pauses at key decision points), or human-required (pre-flagged as a blocker that needs your input before the pipeline continues).

Structured blocker resolution. When the pipeline pauses, you get a clear, machine-readable description of what's needed: blocker type, summary, details, and instructions for resolution. Fix the issue, mark it resolved, and re-run — the pipeline picks up right where it stopped.


What Makes This the Next Step?

I want to be specific about what AGENT TEAMS adds, and what it doesn't change.

What it doesn't change: The underlying AI models are the same ones available to any tool. The quality of any individual agent response is bounded by the model's capability. AGENT TEAMS doesn't make models smarter.

What it adds:

  • Structure. A defined pipeline with stages, validation gates, and typed outputs — instead of an open-ended conversation.
  • Specialization. Domain-specific agents with tailored system prompts, tool access, and model assignments — instead of one generalist handling everything.
  • Persistence. File-based state that survives crashes, supports resume, and creates a full audit trail — instead of session-bound context.
  • Quality loops. An adversarial audit phase that catches planning errors before they become execution errors — instead of hope.
  • Cost efficiency. Tiered model routing that puts expensive models only where they matter — instead of uniform pricing across all tasks.
  • Accountability. A permission audit log and structured blocker protocol — instead of trusting that everything went fine during an unattended run.
  • Learning. Cross-run history that enables pattern analysis and system improvement — instead of every run starting from zero.

The analogy I keep coming back to: existing AI coding tools give you a world-class individual contributor. AGENT TEAMS gives you the project management infrastructure to coordinate a team of them.


Key Takeaways

AGENT TEAMS represents a shift in how I think about AI-assisted work — from "AI as a conversational partner" to "AI as a managed team."

  1. Specialization compounds. When each agent focuses on one domain with a model optimized for that domain, the aggregate quality and cost efficiency improve dramatically.

  2. Plans should be stress-tested before execution. An adversarial audit with a reasoning model catches problems that no amount of careful prompting can prevent.

  3. State must survive anything. Atomic writes, graceful shutdown, and file-based persistence mean the system recovers from crashes, disconnections, and rate limits without losing work.

  4. Humans belong in the loop at decision points, not execution points. Approve plans, resolve blockers, review deliverables — don't babysit terminal output.

  5. Cost optimization is a first-class architectural concern. Across dozens of agent invocations, tiered model routing is the difference between a project costing $5 and a project costing $200.

Existing AI coding tools built the foundation. AGENT TEAMS is the next layer — the coordination, persistence, and quality infrastructure that turns individual brilliance into team execution.


This is an personal project under active development. If you're interested in multi-agent orchestration, structured AI pipelines, or just want to see how the pieces fit together, check out the blog. Contributions and feedback are welcome.

psychology
Cognitive Agents
auto_awesome
Smart Automation
robot_2
AI Infrastructure
bolt
Neural Speed
hub
Seamless Integration
shield_with_heart
Ethical AI

See other articles