From Monolith to 6 Domains: Decomposing a Cyber-Physical RE Platform into Independently Deployable Services
Eighteen months ago, ARKONA was a single Python script that scraped firmware headers and dumped them into a SQLite file. Today it is 47 services across 23 ports, 6 sovereign domains, and a battle rhythm of 26 autonomous agents — all running on Tailscale HTTPS with SHA-256 provenance signing on every artifact. That evolution was not planned. It was earned.
This is the story of how a monolithic reverse engineering tool became a decomposed, domain-driven ecosystem — and the architectural decisions that made it survivable.
The Original Sin: Coupling Everything to the RE Pipeline
The first version of ARKONA had one job: feed firmware blobs into a Ghidra headless analysis pipeline (CIPHER) and produce structured reports. Authentication, logging, scheduling, and output rendering all lived in the same process. This was fine until I needed to add a risk scoring layer grounded in NIST 800-30. Suddenly the "firmware analysis script" needed to know about threat actors, asset criticality, and mission impact — concepts that had nothing to do with parsing ELF headers.
The natural response was to bolt on more classes. Within weeks I had a 4,000-line main.py that did everything and could be safely touched by no one, including me. The first decomposition was not strategic — it was self-defense.
The Domain Model: Six Concerns, Six Boundaries
The insight that unlocked the architecture came from a MITRE ATT&CK framing exercise. When I mapped every function ARKONA was performing against the question "what adversarial or operational concern does this serve?", six clusters emerged with almost no overlap:
- CoreOps — cyber-physical systems RE, CIPHER pipeline, Ghidra integration, hardware artifact ingestion
- REOps — reverse engineering workflows, binary diffing, CVE correlation, firmware provenance
- DevOps — the software factory itself: CI runners, dependency scanning, deploy orchestration
- BizOps — contracts, proposals, client management, financial reporting
- COMET — AI governance, human↔AI task delegation framework, audit trails
- Rango — personal productivity, interview prep, research agents, calendar synthesis
Each domain got its own port range, its own internal state, and — critically — its own failure budget. A crashed BizOps invoice renderer should never take down CIPHER. That isolation requirement drove every subsequent technical decision.
Port Allocation as an Architectural Contract
Port numbers are documentation. I assigned ranges deliberately so that any engineer (or agent) scanning the network could immediately infer domain membership from a port number alone:
# /etc/arkona/ports.conf — domain port assignments
# CoreOps: 8000-8009 (CIPHER API: 8000, provenance: 8001, hw-ingest: 8002)
# REOps: 8010-8019 (firmware-db: 8010, cve-broker: 8011, diff-engine: 8012)
# DevOps: 8020-8029 (factory-api: 8020, scan-worker: 8021, deploy-ctrl: 8022)
# BizOps: 8030-8039 (contracts: 8030, invoices: 8031, crm: 8032)
# COMET: 8040-8049 (governance-api: 8040, audit-log: 8041, delegation: 8042)
# Rango: 8050-8059 (interview-prep: 8050, research: 8051, calendar: 8052)
# Shared: 8080, 8443 (domain-manager, auth gateway)
The Domain Manager (port 8443, the ecosystem's front door) maintains a live registry of all 47 services. When a service comes online it self-registers; when it crashes the Domain Manager marks it degraded and the alerting agent fires within 90 seconds. Right now, 21 of 22 monitored services are healthy — the one outlier is a BizOps worker doing a scheduled schema migration.
The Decomposition Strategy: Strangle, Don't Rewrite
I used the strangler fig pattern, not a big-bang rewrite. Each domain started as a thin FastAPI wrapper around the relevant subset of main.py logic, sharing the original SQLite database. Over successive iterations, each domain got its own database, its own schema migrations, and its own internal event bus.
The inter-domain communication layer — the agent-comm broker — implements a pub/sub model where domains announce events on typed topics and subscribers react asynchronously. NIST SP 800-204 (microservices security) shaped the authentication model: every cross-domain call carries a signed JWT with domain issuer, not a shared secret. This meant that compromising one domain's credentials does not cascade.
# agent-comm: cross-domain event dispatch (simplified)
class DomainEvent(BaseModel):
event_id: str = Field(default_factory=lambda: str(uuid4()))
source_domain: str # e.g. "reops"
topic: str # e.g. "firmware.analyzed"
payload: dict
sha256: str # provenance hash of payload
async def publish(event: DomainEvent, broker: BrokerClient):
# Sign before publish — every artifact carries its hash
event.sha256 = hashlib.sha256(
json.dumps(event.payload, sort_keys=True).encode()
).hexdigest()
await broker.publish(f"arkona.{event.source_domain}.{event.topic}", event)
Provenance signing at the event level — not just at the artifact level — means I can reconstruct the exact chain of custody for any analysis result. For a platform that produces findings with legal and operational weight, this is not optional.
MuXD: Routing Intelligence Across Domains
One unexpected benefit of decomposition was that it exposed the LLM consumption pattern of each domain clearly. CoreOps tasks (binary decompilation summaries, vulnerability narratives) need Claude's full reasoning capability. Rango tasks (calendar synthesis, interview flash cards) can run on a local Ollama model at zero token cost.
MuXD, the hybrid LLM router, sits in front of all agent traffic and classifies requests before they hit any model. It routes to Ollama (Mistral, DeepSeek, Llama3) for low-complexity work and to Claude for tasks requiring multi-step reasoning, code generation, or judgment calls. Token savings since deploying MuXD: substantial enough that the Anthropic API bill no longer scales linearly with agent count. Each domain registers its routing preferences at startup; MuXD enforces them at the network layer.
COMET: Governance as a First-Class Domain
The decision to make AI governance its own domain — not a library, not a middleware layer, but a sovereign service with its own API — was the most unusual architectural choice and the one I am most confident was correct.
COMET implements a 7-step human↔AI delegation framework grounded in IEEE 7010 (wellbeing of AI systems), NIST AI RMF, and ISO/IEC 42001. Every agent in the ecosystem registers its decision boundary with COMET at initialization. When an agent wants to take an action that crosses its declared autonomy threshold — say, sending an external communication or modifying a production config — it must request delegation approval. COMET logs the request, evaluates it against the registered policy, and either approves, escalates to human review, or denies with a structured rationale.
Having this as a domain means the governance logic can evolve independently of the agents it governs. When I tightened the delegation policy for BizOps agents last month, no agent code changed — only COMET's policy engine did.
What 235 Commits in Seven Days Looks Like
The current development pace — 235 commits in the last seven days — is only possible because of domain isolation. I can land a breaking change in the COMET delegation API, run its test suite, and deploy it to port 8040 without touching the 46 other services. The Domain Manager's parallel health checks (a recent optimization that dropped check latency from 4.3 seconds to 3 milliseconds by removing sequential blocking) confirm ecosystem health in near-real time after every deploy.
The dual Tesla P40 GPUs handle local inference across 5 Ollama models while the CPU handles everything else. With 440GB DDR4, no service has ever been killed by the OOM reaper. That headroom matters when you are running 26 agents concurrently during peak battle rhythm.
The Lesson
The hardest part of decomposing a monolith is not the technical work — it is resisting the temptation to decompose along code boundaries instead of concern boundaries. My original instinct was to split by data type: one service for firmware, one for reports, one for users. That would have produced a distributed monolith with network latency added as a gift.
Domain-driven decomposition forced me to ask a harder question: who cares about this capability, and what would they never want entangled with something else? The answer to that question — not the class diagram — is the architecture. Every port range, every event topic, every JWT issuer in ARKONA traces back to that question asked and answered honestly.
If you are building a platform that needs to survive its own growth, make your domains explicit before you write the first service. The cost of getting the boundaries wrong early is paid in every deployment, every incident, and every new feature for as long as the system lives.
```