Building Production-Ready AI Agent Orchestration: 30+ Issues, Hard-Won Lessons
The Problem: When AI Coding Assistants Go Rogue
AI assistants write code, debug, architect systems. They also create chaos.
I spent three months building an AI agent orchestration system. Not a demo. A production system that completed 30+ GitHub issues, averaging 1-2 days per issue, with 7 issues in the last 7 days.
The problem: AI assistants have no memory. They repeat mistakes. Skip validation. Create PRs without documenting learnings. Context-switch and lose everything.
Worse: they can't manage complex workflows. An issue isn't just "write code"—it's spec, decomposition, implementation, validation, knowledge extraction, PR, merge, cleanup. Skip knowledge extraction and you're building on sand.
The challenge: Build something reliable. Enforce workflow discipline. Capture learning. Recover from crashes. Operate autonomously.
This is how I built it.
Two Core Innovations
1. Phase-Based State Machine: Making the Implicit Explicit
Treat AI workflows like state machines with enforced transitions.
AI agents optimize for task completion. Ask them to "extract learnings before merging"—they skip it 60% of the time. Not malicious. Just optional overhead.
The solution isn't better prompts. Make bad paths impossible.
I designed a five-phase state machine where each phase has:
- One specialized agent with a single responsibility
- Explicit entry/exit conditions written to XML state logs
- Enforcement gates that block progression without completion
DISCUSSION → EXECUTE → VALIDATE → COMPACT → COMPLETE
↓ ↓ ↓ ↓ ↓
spec-agent implement validator compact- merge/
workers agent cleanup
Compaction is required, not suggested.
Early versions asked agents to "capture learnings." They skipped it. Current version: pre-merge hooks check for compaction flags. No flag? No PR. Period.
# Pre-merge hook (simplified)
if [ ! -f ".context/compaction-complete.flag" ]; then
echo "ERROR: Must run compaction before creating PR"
exit 1
fi
Result: 100% compaction rate. Zero learning loss.
Why this matters: Agents lose information between phases. Error patterns, spec divergences, architectural decisions—ephemeral unless captured. Mandatory compaction means learning accumulates automatically.
Each worktree gets isolated state in .context/session-log.xml. Phase transitions write explicit markers:
<phase name="compact" status="complete" timestamp="2025-10-20T15:30:45Z">
<patterns-extracted>12</patterns-extracted>
<session-archived>true</session-archived>
</phase>
The orchestrator reads this state before every action. No guessing, no inference—just "what does the XML say?"
2. Idempotent Operations: Production Reliability Through Retry-Safety
Every operation must be safely retryable.
Agents crash. Models hit rate limits. Networks fail. Context windows overflow. In production, these aren't edge cases. They're Tuesday afternoon.
Implementation phase example: Agent is writing code, crashes after completing 3 of 5 tasks. Re-run the phase command. It reads the XML, sees tasks 1-3 are marked complete, continues from task 4. Same inputs → same outputs.
Spec creation example: Agent drafts spec, crashes before finalizing. Re-run. It reads the partial spec from disk, continues refining. No duplicate work, no lost progress.
The pattern for every phase:
- Read state from
.context/session-log.xml - Check completion markers
- Perform remaining work
- Write state atomically
- Set completion flag
Across 30+ issues: multiple crashes. Every one recovered by re-running the phase command. No manual repair. No corruption. No lost work.
Crash at any point? Next run reads the XML and continues.
Real-World Results
30+ issues completed. 1-2 day average. 100% compaction rate. Multiple crashes. Zero data loss.
Peak: 9 issues in 6 days (Oct 1-6).
The system orchestrated its own development.
What Actually Works
State machine enforcement: Zero skipped compactions. The gates work.
Idempotent operations: Multiple agent crashes across 30+ issues. Every single one recovered cleanly by re-running the phase command.
Worktree isolation: Ran 3 issues in parallel during peak sprint. Zero cross-contamination.
Organizational learning: After 30+ issues, complete session archives are preserved in .orchestration/archived-sessions/ for post-mortem analysis and future cross-issue pattern detection.
What's Still Hard
Ambiguous specs kill velocity: Vague acceptance criteria → agent thrashing. System enforces spec finalization but can't force clarity. That's human judgment.
Validation must be incremental: Currently discrete phase after implementation. Better: validate each task as it completes. Architecture permits this. Doesn't enforce it yet.
Merge conflicts need automation: Manual rebases required when branches diverge. Rebase-agent proposed (issue #81). Not implemented.
Hard-Won Lessons
Enforcement beats prompts: "Please capture learnings" → 60% skip rate. Block merges without compaction → 100% success.
Observability is mandatory: XML logs for every transition. When things break, you need the audit trail.
Idempotency from day one: Network failures, rate limits, context overflows happen regularly. Design for retries from the start.
State machines need explicit markers: Don't make orchestrators guess. Require explicit completion markers. No inference.
Why This Matters
Agentic systems can be production-ready today. Not demos. 30+ issues prove it.
The key: engineer for reliability, not capability. Don't prompt "be careful." Build systems that make failure impossible.
After 30+ issues:
- Phase-based state machines with enforcement
- Idempotent operations
- Mandatory learning extraction
This isn't the final answer. It's proof that agentic systems work when we engineer them properly.