Multi-agent communication patterns: pub/sub, task delegation, and the broker architecture behind 26 coordinated agents
At ARKONA, we've built an autonomous multi-agent AI ecosystem currently operating with 26 agents, covering domains from cyber-physical reverse engineering (CoreOps) to personal productivity (Rango). Managing communication and coordination amongst this many agents presents significant architectural challenges. It’s not simply about getting messages from A to B; it's about ensuring resilience, scalability, and a clear separation of concerns. I’ve found that a hybrid approach, leveraging publish-subscribe, task delegation, and a centralized broker, works best in practice.
Publish-Subscribe for Broad Awareness
Initially, we attempted direct peer-to-peer communication. While conceptually simple, it quickly became unwieldy. Each agent needing to know about events from others created a combinatorial explosion of connections. We adopted a publish-subscribe (pub/sub) pattern, which is foundational to our inter-agent communication. The core idea is that agents *publish* events to named topics, and other agents *subscribe* to the topics they're interested in. This decouples producers from consumers.
In ARKONA, this is implemented using a custom service called ‘MCP Server’ (Message Coordination Protocol), running on port 8081 over Tailscale HTTPS. The MCP Server isn't a full-fledged message queue like Kafka or RabbitMQ, because we need low latency and a smaller footprint – crucial given we're running this on infrastructure with limited resources (dual Tesla P40 GPUs, 440GB DDR4). Instead, it’s a lightweight broker built with Python and asyncio, leveraging WebSockets for persistent connections. Agents maintain a WebSocket connection to the MCP Server and register their subscriptions. Here's a simplified example of how an agent might subscribe to a 'new_ghidra_analysis' topic:
# Python (simplified example)
import asyncio
import websockets
async def subscribe(topic):
uri = "https://mcp.tail85a379.ts.net:8081/subscribe"
async with websockets.connect(uri) as websocket:
await websocket.send(topic)
print(f"Subscribed to topic: {topic}")
while True:
message = await websocket.recv()
print(f"Received message on {topic}: {message}")
asyncio.get_event_loop().run_until_complete(subscribe('new_ghidra_analysis'))
The ‘CIPHER’ service (our hardware RE pipeline, integrating with Ghidra) publishes events to topics like 'new_ghidra_analysis', 'vulnerability_detected', and 'code_coverage_report'. The 'REOps' agents, responsible for reverse engineering task management, subscribe to these topics to monitor progress and prioritize analysis. The ‘COMET’ AI Governance service also subscribes to security-related topics for risk evaluation (grounded in NIST 800-30).
Task Delegation: Beyond Simple Notifications
Pub/sub handles broadcasting information, but it's insufficient for complex interactions that require a response or acknowledgement. For these scenarios, we employ task delegation. An agent needing a task performed sends a request to another agent, including details and expected output. This is distinct from simple event notification. The requesting agent expects a response and potentially monitors progress.
The MCP Server facilitates task delegation by acting as a request router. An agent sends a request to the MCP Server, specifying the target agent and task parameters. The MCP Server then forwards the request to the designated agent. The response follows the same path back through the MCP Server. This allows for auditing and potentially, intelligent routing based on agent load or expertise.
For example, the 'DevOps' service (our software factory) might delegate a task to one of our 26 autonomous agents responsible for research. This agent, operating on a battle rhythm, would perform the research and return the results to the DevOps service. The task specification is serialized using JSON and includes a unique task ID for tracking.
The Broker Architecture: Balancing Decentralization & Control
While a fully decentralized agent system is theoretically appealing, we found it lacked the necessary control and observability. A purely centralized system, on the other hand, became a single point of failure and a bottleneck. Our solution is a hybrid – a broker architecture with a lightweight central broker (MCP Server) augmented by direct communication where appropriate.
The MCP Server manages pub/sub and task delegation. However, certain agents – particularly those within the same domain – may establish direct, persistent connections for high-frequency, low-latency communication. For instance, agents within 'CoreOps' performing cyber-physical reverse engineering often share intermediate results directly to speed up analysis. This is a conscious tradeoff between architectural purity and performance.
Currently, we have 47 services distributed across 23 ports (all secured with Tailscale HTTPS and WebAuthn/Face ID authentication). The MCP Server, running on port 8081, manages communication for the vast majority of interactions. Monitoring shows it currently handles approximately 1500 messages per minute with an average latency of 20ms. We’ve designed it to be horizontally scalable, though we haven't yet needed to deploy multiple instances.
Provenance and Security
Given the sensitive nature of our work, especially within the 'CoreOps' domain, maintaining provenance is critical. All messages exchanged through the MCP Server are cryptographically signed using SHA-256, ensuring authenticity and integrity. This also forms a crucial component of our overall data lineage tracking.
COMET Framework Integration
Our COMET (AI Governance) service leverages this communication architecture to implement our 7-step human↔AI delegation framework, based on IEEE and NIST standards. The framework defines levels of autonomy and requires explicit approval workflows for critical decisions. The MCP Server logs all task delegations and responses, providing an audit trail for COMET to assess compliance and identify potential risks. The fact-checking component of our 5-agent newsroom editorial pipeline similarly relies on these communication logs.
Lessons Learned & Ongoing Work
One key takeaway from building this system is the importance of balancing architectural elegance with pragmatic considerations. While fully decentralized architectures sound great in theory, they often fall short in practice due to complexity and lack of control. The hybrid broker architecture allows us to achieve both scalability and observability.
We're currently exploring ways to optimize our MuXD hybrid LLM router (Ollama local + Claude cloud) to further reduce latency and token usage. We’re also investigating more sophisticated routing algorithms for the MCP Server, potentially incorporating machine learning to dynamically adjust routes based on agent load and expertise. With 234 commits in the last 7 days, this is a rapidly evolving system, and we’re always learning and adapting.