Building Production-Ready AI Agent Orchestration: 30+ Issues, Hard-Won Lessons

Marc Evangelista

Oct 23, 2025 • 3 min read

The Problem: When AI Coding Assistants Go Rogue

AI assistants write code, debug, architect systems. They also create chaos.

I spent three months building an AI agent orchestration system. Not a demo. A production system that completed 30+ GitHub issues, averaging 1-2 days per issue, with 7 issues in the last 7 days.

The problem: AI assistants have no memory. They repeat mistakes. Skip validation. Create PRs without documenting learnings. Context-switch and lose everything.

Worse: they can't manage complex workflows. An issue isn't just "write code"—it's spec, decomposition, implementation, validation, knowledge extraction, PR, merge, cleanup. Skip knowledge extraction and you're building on sand.

The challenge: Build something reliable. Enforce workflow discipline. Capture learning. Recover from crashes. Operate autonomously.

This is how I built it.

Two Core Innovations

1. Phase-Based State Machine: Making the Implicit Explicit

Treat AI workflows like state machines with enforced transitions.

AI agents optimize for task completion. Ask them to "extract learnings before merging"—they skip it 60% of the time. Not malicious. Just optional overhead.

The solution isn't better prompts. Make bad paths impossible.

I designed a five-phase state machine where each phase has:

One specialized agent with a single responsibility
Explicit entry/exit conditions written to XML state logs
Enforcement gates that block progression without completion

DISCUSSION → EXECUTE → VALIDATE → COMPACT → COMPLETE
    ↓           ↓          ↓          ↓          ↓
spec-agent   implement  validator  compact-   merge/
             workers               agent      cleanup

Compaction is required, not suggested.

Early versions asked agents to "capture learnings." They skipped it. Current version: pre-merge hooks check for compaction flags. No flag? No PR. Period.

# Pre-merge hook (simplified)
if [ ! -f ".context/compaction-complete.flag" ]; then
    echo "ERROR: Must run compaction before creating PR"
    exit 1
fi

Result: 100% compaction rate. Zero learning loss.

Why this matters: Agents lose information between phases. Error patterns, spec divergences, architectural decisions—ephemeral unless captured. Mandatory compaction means learning accumulates automatically.

Each worktree gets isolated state in .context/session-log.xml. Phase transitions write explicit markers:

<phase name="compact" status="complete" timestamp="2025-10-20T15:30:45Z">
  <patterns-extracted>12</patterns-extracted>
  <session-archived>true</session-archived>
</phase>

The orchestrator reads this state before every action. No guessing, no inference—just "what does the XML say?"

2. Idempotent Operations: Production Reliability Through Retry-Safety

Every operation must be safely retryable.

Agents crash. Models hit rate limits. Networks fail. Context windows overflow. In production, these aren't edge cases. They're Tuesday afternoon.

Implementation phase example: Agent is writing code, crashes after completing 3 of 5 tasks. Re-run the phase command. It reads the XML, sees tasks 1-3 are marked complete, continues from task 4. Same inputs → same outputs.

Spec creation example: Agent drafts spec, crashes before finalizing. Re-run. It reads the partial spec from disk, continues refining. No duplicate work, no lost progress.

The pattern for every phase:

Read state from .context/session-log.xml
Check completion markers
Perform remaining work
Write state atomically
Set completion flag

Across 30+ issues: multiple crashes. Every one recovered by re-running the phase command. No manual repair. No corruption. No lost work.

Crash at any point? Next run reads the XML and continues.

Real-World Results

30+ issues completed. 1-2 day average. 100% compaction rate. Multiple crashes. Zero data loss.

Peak: 9 issues in 6 days (Oct 1-6).

The system orchestrated its own development.

What Actually Works

State machine enforcement: Zero skipped compactions. The gates work.

Idempotent operations: Multiple agent crashes across 30+ issues. Every single one recovered cleanly by re-running the phase command.

Worktree isolation: Ran 3 issues in parallel during peak sprint. Zero cross-contamination.

Organizational learning: After 30+ issues, complete session archives are preserved in .orchestration/archived-sessions/ for post-mortem analysis and future cross-issue pattern detection.

What's Still Hard

Ambiguous specs kill velocity: Vague acceptance criteria → agent thrashing. System enforces spec finalization but can't force clarity. That's human judgment.

Validation must be incremental: Currently discrete phase after implementation. Better: validate each task as it completes. Architecture permits this. Doesn't enforce it yet.

Merge conflicts need automation: Manual rebases required when branches diverge. Rebase-agent proposed (issue #81). Not implemented.

Hard-Won Lessons

Enforcement beats prompts: "Please capture learnings" → 60% skip rate. Block merges without compaction → 100% success.

Observability is mandatory: XML logs for every transition. When things break, you need the audit trail.

Idempotency from day one: Network failures, rate limits, context overflows happen regularly. Design for retries from the start.

State machines need explicit markers: Don't make orchestrators guess. Require explicit completion markers. No inference.

Why This Matters

Agentic systems can be production-ready today. Not demos. 30+ issues prove it.

The key: engineer for reliability, not capability. Don't prompt "be careful." Build systems that make failure impossible.

After 30+ issues:

Phase-based state machines with enforcement
Idempotent operations
Mandatory learning extraction

This isn't the final answer. It's proof that agentic systems work when we engineer them properly.