Building a 46-Service Autonomous Ecosystem with Claude Code

I did not build a 48-service autonomous ecosystem because it would look impressive on a resume. I built it because I needed to answer a question that no benchmark can answer: do AI agents actually work when nobody is watching?

You can evaluate an agent on HumanEval or SWE-bench. You can measure token latency and tool-call accuracy in controlled environments. But none of that tells you whether your agent will correctly handle a metrics drift at 0300 on a Tuesday, recover from a cascading restart loop, or know when to stop changing things and escalate to a human. The only way to test sustained autonomous operation is to run it in sustained autonomous operation. So that is what I did.

Over the past several months, I have built and operated an ecosystem of 48 interconnected services—research agents, governance dashboards, security monitors, domain managers, build systems, and inference routers—all orchestrated primarily through Claude Code. This is what I learned about building agentic systems that actually work.

Architecture Decisions That Mattered

Agent-per-Script Over Frameworks

Early on, I evaluated LangChain, CrewAI, and AutoGen. Each offers abstractions that promise to simplify multi-agent coordination. Each also introduces layers of indirection that make debugging at 3 AM nearly impossible. When an agent fails inside a framework, you are debugging the framework's state machine, not your logic.

I chose a different path: each agent is a bash script with explicit tool permissions. No shared state between agents. No framework magic. No dependency chains that turn a single-point failure into a cascade. A research agent that runs at 0200 is a script that calls Claude's API with a specific system prompt, a defined set of MCP tools, and a circuit breaker that kills it after 12 consecutive failures. When it breaks, I read the script. When it works, I read the logs. There is no mystery layer in between.

This is not elegant in the way framework authors define elegance. But it is debuggable, replaceable, and comprehensible at any hour. In production systems, those properties matter more than abstraction.

Tool-Use Over Prompt Chains

The ecosystem exposes 6 MCP servers with over 40 tools. Claude orchestrates work through structured tool calls—reading files, searching codebases, running commands, querying databases—rather than through multi-step prompt chains that pass context from one generation to the next.

This distinction matters more than it appears. A prompt chain is fragile because each step depends on the previous step's output being formatted correctly, being complete, and being relevant. One hallucinated intermediate result corrupts everything downstream. Tool calls, by contrast, ground the agent in reality at every step. The file system does not hallucinate. The grep results are either there or they are not. The git log is what it is.

When I designed the ecosystem's agent architecture, I optimized for maximum tool grounding and minimum unverified generation. Every agent has access to verification tools that let it check its own work before committing to an action. This is not about distrusting the model. It is about building systems where trust is verified continuously rather than assumed once.

Hybrid Routing with MuXD

Not every prompt needs Claude. This is an uncomfortable truth for anyone building on frontier models, but it is economically and architecturally important. My hybrid router, MuXD, uses a semantic classifier with per-task confidence thresholds to decide which model handles each request.

The thresholds are calibrated by task criticality: drafting and summarization routes locally at a 0.50 confidence threshold. Code review requires 0.75. Architecture decisions and security analysis require 0.80 or higher and always route to Claude. The result is a 45% reduction in API costs without meaningful quality degradation on routine tasks.

The architectural insight here is not about saving money. It is about appropriate allocation of capability. A system that routes everything to the most powerful model available is not well-engineered—it is lazy. Good systems engineering means matching capability to requirement, whether you are sizing a power supply or choosing an inference endpoint.

What Breaks and Why

The 12-Crash Circuit Breaker

Three weeks into autonomous operation, a research agent entered a restart loop. The trigger was prosaic: a website it regularly scraped changed its HTML structure. The agent's tool call returned unexpected content, it retried with a modified query, got the same result, and retried again. Twelve times in four minutes before the circuit breaker killed it.

Without the circuit breaker, it would have burned through API credits indefinitely. But the more interesting lesson was what happened after the kill. The agent had a recovery protocol: log the failure context, notify the governance dashboard, and queue the task for human review. When I checked at 0600, there was a clear trail showing exactly what went wrong, when the breaker tripped, and what work remained undone.

This is what I mean by designing for failure rather than designing against it. The agent did fail. The system did not.

Metrics Integrity Watchdog

One of the subtler failure modes in autonomous systems is metric drift. An agent reports completing a task, increments a counter, but the task was only partially complete. Or a health check passes because it tests connectivity but not correctness. Over days and weeks, small inaccuracies compound into a dashboard that says everything is fine while reality diverges.

I built a dedicated watchdog agent whose sole job is verifying that reported metrics match ground truth. It does not trust the health check—it independently verifies the state that the health check claims to measure. When it finds a discrepancy, even a single-count drift, it flags it immediately rather than waiting for the error to compound.

The principle is recursive verification: you need to verify the verifier. Health checks are necessary but not sufficient. Any metric that drives autonomous decision-making needs an independent confirmation path that does not share failure modes with the primary measurement.

Why Health Checks Are Not Enough

A service can respond to a health check with HTTP 200 while serving stale data. A database can accept connections while its replication lag grows unbounded. A model endpoint can return valid JSON while the model itself has degraded due to thermal throttling on the GPU.

In my ecosystem, each service has three layers of verification: a liveness check (is the process running), a readiness check (can it accept work), and a correctness check (is its output actually right). The correctness checks are the expensive ones—they require running known-good inputs through the service and comparing outputs against expected results. But they are the only ones that catch the failures that matter.

The Governance Gap: COMET

Most conversations about AI agents focus on capability: can the agent do this task? This is the wrong first question. The right first question is: should the agent do this task, and with what level of human oversight?

This gap—between what an agent can do and what it should be allowed to do—is what led me to build COMET, a 7-step AI governance framework that treats the delegation decision as a first-class engineering problem rather than an afterthought.

The workflow starts with role discovery: identifying every task in a business function and classifying it across five delegation levels, from fully human to fully autonomous. Each task gets scored against risk criteria, compliance requirements, and organizational readiness. The output is a RACI matrix where AI is an explicit participant with defined responsibilities and defined boundaries.

COMET cross-references over 20 compliance frameworks automatically—NIST AI RMF, ISO 42001, SOC 2, CMMC, and domain-specific standards. When you assign a task to autonomous operation, COMET tells you which compliance controls you need to satisfy and what audit trail you need to maintain. This is not theoretical governance. It is operational governance that runs alongside the agents themselves.

The insight that drove COMET's design is simple: governance cannot be a gate that happens before deployment. It has to be a continuous function that runs during operation. An agent's delegation level should be dynamic—earned through demonstrated reliability and revoked when performance degrades. COMET implements this through continuous monitoring of agent performance against the criteria that justified their delegation level in the first place.

Claude Code as a Build Partner

I want to be honest about the partnership because intellectual honesty matters more than impressive claims. What would take a team of engineers months to design, implement, and integrate, Claude Code and I built in weeks. That is a real and significant capability multiplier. But it requires understanding where the multiplier applies and where it does not.

Where Claude Code Excels

Architecture planning. When I describe a system's requirements, constraints, and failure modes, Claude Code helps me explore the solution space faster than I could alone. It proposes architectures, identifies edge cases I have not considered, and stress-tests designs through adversarial questioning. The quality of this collaboration depends entirely on the quality of the constraints I provide—garbage in, garbage out still applies.

Boilerplate generation. A new service needs a health check endpoint, a configuration loader, error handling, logging setup, and integration with the ecosystem's monitoring. Claude Code produces this reliably and consistently. This is not trivial—it frees me to focus on the logic that is actually unique to each service rather than re-implementing infrastructure patterns for the forty-sixth time.

Parallel exploration. When I am uncertain which approach is correct, I can have Claude Code prototype two or three alternatives in the time it would take me to implement one. This changes the economics of technical decision-making. Instead of committing to an approach based on intuition and living with it, I can evaluate concrete implementations before choosing.

Where It Needs Guardrails

Long-horizon refactors. Claude Code operates within a context window. When a refactor touches thirty files across five services and requires maintaining consistency across all of them, the agent needs careful session management and explicit state tracking. It does not naturally remember that it renamed an interface three sessions ago and that six downstream consumers need updating. I manage this with session-scoped checklists and explicit verification steps.

State management across sessions. Each Claude Code session starts fresh. It does not remember the architectural decisions from yesterday's session unless I encode them in documentation or configuration. This is actually a feature—it prevents stale context from poisoning new work—but it means the human needs to be the continuity layer. I maintain architecture decision records specifically so that each new session can be grounded in prior decisions without inheriting prior confusion.

Knowing when NOT to change something. This is the subtlest guardrail. Claude Code, like any capable agent, has a bias toward action. When asked to improve something, it will improve it. But sometimes the correct action is to leave working code alone, even if it could be marginally better. I have learned to be explicit about scope boundaries: "fix this bug, do not refactor the surrounding code" is a necessary instruction that I now include reflexively.

Practitioner's Intuition

After hundreds of sessions building this ecosystem, I have developed what I can only call practitioner's intuition about agent capabilities. I know which tasks to express as a single prompt and which to decompose into a sequence of smaller instructions. I know when to let the agent explore freely and when to constrain it to a specific approach. I know the difference between a prompt that will produce reliable results across runs and one that will produce variable results that require manual verification.

This intuition is not something you can learn from documentation. It comes from operating at the boundary between what works reliably and what works sometimes. It comes from debugging the failures that happen when you push past that boundary. And it is, I believe, one of the most valuable skills for anyone building agentic systems at scale.

The Thesis

Agent infrastructure needs systems engineers, not just ML researchers.

The hard problems in agentic AI are not primarily about model capability. The models are remarkably capable. The hard problems are about everything that surrounds the model: failure detection, graceful degradation, state management, operational monitoring, governance, and the discipline to build systems that work reliably rather than systems that work impressively in demos.

These are systems engineering problems. They require thinking about failure modes before they occur, designing recovery paths for every critical function, building verification systems that do not share failure modes with the things they verify, and maintaining operational discipline when the system is running autonomously at 0300 and nobody is watching.

That is what 25 years in defense systems engineering teaches you. Not how to build the most sophisticated system, but how to build systems that do not fail when failure has consequences. The transfer from designing fault-tolerant avionics to designing fault-tolerant agent architectures is more direct than it might appear. In both cases, the engineering discipline is the same: understand your failure modes, design your recovery paths, verify your verification, and never assume that something works just because it worked yesterday.

The 48-service ecosystem I built is not impressive because of its size. It is useful because it runs. Continuously, autonomously, and correctly—even at 3 AM on a Tuesday when nobody is watching. That is the standard I believe agent infrastructure should be held to, and it is the standard I intend to bring to building the next generation of agentic systems.