Building an AI Governance Taxonomy: Mapping 816 Task Definitions to Human-AI Delegation Levels

The core challenge in scaling an autonomous multi-agent system like ARKONA isn’t just building the agents themselves, but defining the boundaries of their authority. With 26 agents currently operating on a battle rhythm, coordinating research, editorial processes, and system monitoring, we’ve spent the last six months developing a rigorous task taxonomy and mapping it to a 7-step human-AI delegation framework. This goes beyond simple automation; it's about *governance* – ensuring responsible AI operation within a complex cyber-physical ecosystem. We're using this framework not just for ARKONA's internal operations, but as a foundational element for COMET, our AI governance service running on port 8087, which we envision as a building block for wider adoption.

The Need for a Granular Taxonomy

Initially, we approached delegation at a high level: “Agent X handles data ingestion,” or “Agent Y drafts initial reports.” This proved insufficient. The variability *within* those tasks was immense. Data ingestion encompasses everything from passively monitoring Tailscale connection states (port 8001) to actively querying the CIPHER hardware RE pipeline (port 8015) for new Ghidra project artifacts. A “draft report” could range from a simple summary of network traffic to a complex risk assessment leveraging our NIST 800-30 grounded risk evaluation engine.

To address this, we started by meticulously documenting all observable tasks performed by our agents. This resulted in an initial list of 816 distinct task definitions, each detailing inputs, outputs, dependencies, and estimated complexity. We treated this as a systems engineering problem, not just an AI one – think of it as functional decomposition, but with the added layer of determining appropriate human oversight.

The 7-Step Delegation Framework (COMET)

COMET, our AI Governance service, is built on a framework inspired by IEEE and NIST guidance on AI risk management. The 7 steps, loosely based on MITRE ATT&CK’s approach to threat modeling, define levels of human-AI interaction:

  1. Define: Human specifies the task goal, constraints, and success criteria.
  2. Monitor: AI executes the task, and a human actively monitors progress.
  3. Review: AI completes the task, and a human reviews the output before any action is taken.
  4. Approve: AI proposes an action, and a human provides final approval.
  5. Delegate: AI autonomously executes the task, with pre-defined fallback mechanisms.
  6. Supervise: AI executes a task series; human provides high-level oversight and intervenes only on exception.
  7. Autonomy: AI operates fully independently, continuously learning and adapting (reserved for the most mature agents and well-defined tasks).

Each task within our taxonomy is then assigned to one of these levels. For example: generating a daily system health report (REOps agent, port 8009) is currently at ‘Review’ – the agent drafts the report, but a human editor within our 5-agent newsroom editorial pipeline performs fact-checking and ensures clarity before publication. Conversely, passively monitoring Tailscale HTTPS connections (CoreOps agent, port 8001) is at ‘Delegate’ – the agent alerts on anomalies, but doesn't require human intervention for normal operation.

Technical Implementation: Task Metadata and Agent Contracts

This taxonomy isn't just a document; it's deeply integrated into our system architecture. We use JSON schema to define task metadata, which includes:


{
  "task_id": "REOPS-042",
  "description": "Generate daily system health report",
  "agent_id": "reops-agent-01",
  "delegation_level": "Review",
  "input_schema": {
    "type": "object",
    "properties": {
      "start_time": { "type": "string", "format": "date-time" },
      "end_time": { "type": "string", "format": "date-time" }
    }
  },
  "output_schema": { "type": "string" },
  "dependencies": ["system-metrics-collector", "log-aggregator"]
}

This metadata is crucial for several reasons. First, it allows the inter-agent communication broker (MCP server, port 8006) to route tasks to the appropriate agent. Second, it’s used by COMET to enforce delegation policies. We leverage the broker's pub/sub capabilities to intercept tasks based on their delegation_level. For example, tasks at 'Review' are automatically routed to a human-in-the-loop workflow via a dedicated queue.

We also define "agent contracts" – formal specifications of each agent’s capabilities and limitations. These contracts, also encoded in JSON schema, are used to verify that an agent is qualified to handle a particular task at a given delegation level. This helps prevent unexpected behavior and ensures accountability.

MuXD and Token Savings Optimization

A key constraint in our AI governance design is cost. We're heavily reliant on MuXD, our hybrid LLM router. Tasks requiring complex reasoning or creative writing are routed to Claude (cloud), while simpler tasks (e.g., data filtering, summarization) are handled locally by one of our five Ollama models (running on the dual Tesla P40 GPUs). The delegation level directly impacts LLM selection. 'Autonomy' and 'Supervise' tasks are optimized for local execution whenever possible, minimizing cloud API costs. We’ve seen up to a 30% reduction in token usage by strategically routing tasks based on delegation level and inherent complexity.

Challenges and Lessons Learned

This hasn't been without its challenges. Maintaining a taxonomy of 816 tasks is an ongoing effort – the system evolves, and new tasks emerge constantly. We’ve implemented automated tools to assist with task discovery and categorization, but human review remains essential.

We also underestimated the complexity of defining “success criteria” for each task. What constitutes a “good” system health report? How do we measure the accuracy of a risk assessment? These questions require careful consideration and often necessitate collaboration between AI engineers and domain experts.

The biggest lesson learned is that AI governance isn’t just about *controlling* AI; it’s about *enabling* responsible innovation. By providing a clear framework for human-AI delegation, we’ve not only reduced risk but also accelerated our development velocity. With 236 commits in the last 7 days, the team is clearly empowered by the structure COMET provides.

Ultimately, this taxonomy and framework isn’t just for ARKONA. Our goal is to contribute to the broader conversation around AI governance and provide a practical, scalable solution for building trustworthy AI systems.